pip install torch transformers sklearn pandas

For Mac M1:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

In [1]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm Elon Musk,", max_length=30, num_return_sequences=5)

  from .autonotebook import tqdm as notebook_tqdm
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm Elon Musk, I'm CEO of Intel. In my previous years working at Intel, I spent many years on the research and development"},
 {'generated_text': "Hello, I'm Elon Musk, and I'm President/CEO of Tesla. And it's so humbling. This is the kind of kind of"},
 {'generated_text': "Hello, I'm Elon Musk, the world's most valuable entrepreneur, and I wanted to hear from you about why I'm passionate about building a network"},
 {'generated_text': "Hello, I'm Elon Musk, a billionaire tech investor and investor in SpaceX and the Space Exploration Technologies Corporation. If you aren't familiar he is Tesla"},
 {'generated_text': "Hello, I'm Elon Musk, the CEO and Co-Founder of SpaceX and founder of the SpaceX Education Lab. It's not easy, but"}]

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("dataset/train_cleaned.csv")['content'].to_numpy()
train, test = train_test_split(data,test_size=0.15)
traindata = ''
testdata = ''
for i in train:
    traindata += i.replace("&amp", "") +'\n'
f = open('train_dataset.txt','w')
f.write(traindata)
for i in test:
    testdata += i.replace("&amp","") +'\n'
f = open('test_dataset.txt','w')
f.write(testdata)

170977

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
train_path = 'train_dataset.txt'
test_path = 'test_dataset.txt'

In [5]:
from transformers import LineByLineTextDataset,DataCollatorForLanguageModeling

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = LineByLineTextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=32)

    test_dataset = LineByLineTextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=32)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)



In [6]:
from transformers import Trainer, TrainingArguments, AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained("gpt2")

training_args = TrainingArguments(
    output_dir="./gpt2-musk", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 40, # Number of update steps between two evaluations.
    save_steps=80, # after # steps model is saved
    warmup_steps=50,# number of warmup steps for learning rate scheduler
    )

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    #prediction_loss_only=True,
)



In [7]:
trainer.train()

***** Running training *****
  Num examples = 14262
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1338
  0%|          | 0/1338 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [262,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [262,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [262,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: blo

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

In [10]:
trainer.save_model()

Saving model checkpoint to ./gpt2-musk
Configuration saved in ./gpt2-musk/config.json
Model weights saved in ./gpt2-musk/pytorch_model.bin


In [11]:
from transformers import pipeline

tweet = pipeline('text-generation',model='gpt2-musk', tokenizer=tokenizer )

loading configuration file gpt2-musk/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2-musk",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.20.1",
  "use_cache": true,
  "vocab_size": 50257
}

loading config

In [13]:
#generator = pipeline('text-generation', model='gpt2')
from transformers import pipeline, set_seed
set_seed(42)
tweet("With steel membrane wings like a Dragon,", max_length=50, num_return_sequences=5)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'With steel membrane wings like a Dragon, these massive wings may fly off into the distance, and then descend under pressure to rest for long periods of time. It is believed the dragon was used as a shield under heavy battle, in the early medieval period'},
 {'generated_text': 'With steel membrane wings like a Dragon, which have wing and flap halves and are folded down, can extend by 4.20 metres to support their enormous wings. It also contains large, rigid, air pockets. Its air-resistant material (BH'},
 {'generated_text': "With steel membrane wings like a Dragon, the wings are much bigger than what is shown in this picture. They cover more distance with more feathers on their feathers, and the wings are also very short compared to most dragons. But it's still hard-"},
 {'generated_text': "With steel membrane wings like a Dragon, a Dragon moves much like a Phoenix or a Phoenix's. Each wing has a small blade, and the wings spread as their movement is detected, so wh