**1. Mount Google Drive to Colab Notebook**

This step is necessary to access files stored in your Google Drive directly from the Colab notebook.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**2. Load and Process the Text File**

Here we are opening a text file that contains the data we will use to fine-tune the GPT-2 model. The file is read line by line, and each line is stripped of any leading or trailing white space. You should install your txt file to your google drive as well.


In [2]:
with open('/content/drive/My Drive/statementlarsscv3.txt', 'r') as f:
    lines=f.readlines()
lines = [line.strip() for line in lines]
print(len(lines))


2410


**3. Install Necessary Libraries**

We need several libraries for this project, including Hugging Face's transformers and datasets, the accelerate library for mixed-precision training, and others. You also need to open a huggingface account and get a huggingface token to access your hub.

In [3]:
!pip install transformers[torch]
!pip install accelerate -U
#!pip install transformers
!pip install datasets
!pip install --upgrade huggingface_hub
!huggingface-cli login




    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate thro

**4. Import Libraries and Load Model**

We're importing the necessary functions and classes from the installed libraries. Then we check if a GPU is available for training, and load the GPT-2 tokenizer and model.

In [6]:
import torch, os, re, pandas as pd, json
from sklearn.model_selection import train_test_split
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding, GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, AutoConfig
from datasets import Dataset
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments


In [7]:

# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

In [None]:
if torch.cuda.is_available():
    dev = "cuda:0"
else:
    dev = "cpu"
device = torch.device(dev)

**5. Define Special Tokens and Modify the Tokenizer and Model**

We're defining new special tokens (bos - beginning of string, eos - end of string, and pad - padding) and adding them to our tokenizer. The pre-trained model is then loaded with a custom configuration to accommodate these new tokens, and the model's embeddings are resized accordingly.

In [9]:
# the eos and bos tokens are defined
bos = '<|endoftext|>'
eos = '<|EOS|>'
pad = '<|pad|>'

special_tokens_dict = {'eos_token': eos, 'bos_token': bos, 'pad_token': pad}

# the new token is added to the tokenizer
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

# the model config to which we add the special tokens
config = AutoConfig.from_pretrained('gpt2',
                                    bos_token_id=tokenizer.bos_token_id,
                                    eos_token_id=tokenizer.eos_token_id,
                                    pad_token_id=tokenizer.pad_token_id,
                                    output_hidden_states=False)

# the pre-trained model is loaded with the custom configuration
base_model = GPT2LMHeadModel.from_pretrained('gpt2', config=config)

# the model embedding is resized
base_model.resize_token_embeddings(len(tokenizer))


Embedding(50259, 768)

**6. Prepare the Data**

We're creating a pandas dataframe from our lines of text and add special tokens to each line. Then, we're splitting the data into a training set and a validation set. Each set is then converted to a Hugging Face dataset.

In [10]:
#from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification,DataCollatorWithPadding
# Tokenize the text data and convert it to tensors
#lines
df=pd.DataFrame(lines,columns=["texts"])
df["texts"]=bos+df["texts"]+eos
df_train,df_val=train_test_split(df,test_size=0.1,random_state=42)
#input_ids = [tokenizer.encode(line, return_tensors="pt",padding=True) for line in lines]

In [11]:
df_train.reset_index(inplace=True)
df_train.drop(["index"],axis=1,inplace=True)
df_train


Unnamed: 0,texts
0,<|endoftext|>Components that can be isolated a...
1,<|endoftext|>The safe -life for pressure vesse...
2,<|endoftext|>Vent outlets shall be protected a...
3,<|endoftext|>The flaw detection capability sha...
4,<|endoftext|>The Range User responsible for t...
...,...
2164,<|endoftext|>Manual or remote valve actuators ...
2165,"<|endoftext|>In addition, procurement require..."
2166,<|endoftext|>Every pressure vessel and pressur...
2167,"<|endoftext|>After the pressure cycle testing,..."


In [12]:
df_val.reset_index(inplace=True)
df_val.drop(["index"],axis=1,inplace=True)
df_val

Unnamed: 0,texts
0,<|endoftext|>All ETS interconnections shall pr...
1,<|endoftext|>Batteries/cells shall be evaluate...
2,<|endoftext|>A material compatib ility analysi...
3,<|endoftext|>Loads combinations shall be in ac...
4,<|endoftext|>9 and shall be constructed of wr...
...,...
236,"<|endoftext|>For non - metallic lined COPVs, t..."
237,<|endoftext|>Cryogenic vessels and tanks shal...
238,<|endoftext|>EMFR power at the EED shall not e...
239,<|endoftext|>Lift trucks shall be in accordanc...


In [13]:
train_dataset=Dataset.from_pandas(df_train[["texts"]])
train_dataset
#df_val#.drop(["index"],axis=1,inplace=True)
#val_dataset=Dataset.from_pandas(df_val)
#val_dataset

Dataset({
    features: ['texts'],
    num_rows: 2169
})

In [14]:
val_dataset=Dataset.from_pandas(df_val[["texts"]])
val_dataset


Dataset({
    features: ['texts'],
    num_rows: 241
})

**7. Tokenize the Datasets**

A function to tokenize the datasets is defined, and then applied to both the training and validation datasets. This will prepare the datasets for input into our model.

In [15]:
def tokenize_function(examples,base_tokenizer=tokenizer):
        return base_tokenizer(examples['texts'], padding=True)

tokenized_train_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=5,
    remove_columns=['texts'],
)
tokenized_val_dataset = val_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=5,
    remove_columns=['texts'],
)


Map (num_proc=5):   0%|          | 0/2169 [00:00<?, ? examples/s]

Map (num_proc=5):   0%|          | 0/241 [00:00<?, ? examples/s]

In [16]:
tokenizer.decode(tokenized_train_dataset['input_ids'][1])



'<|endoftext|>The safe -life for pressure vessels and pressurized structures shall be established  assuming the existence of pre -existing initial flaws or cracks in the vessel and shall cover  the maximum expected operating loads and environments. <|EOS|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|>'

**8. Set Up the Data Collator and Training Arguments, Initialize the Trainer and Start Training**

The data collator is responsible for collating multiple data samples into a batch. We're also defining the arguments for training, such as the number of epochs, batch size, and learning rate schedule.

We're initializing the Trainer class with our model, training arguments, and datasets, and then begin training the model.

In [17]:
data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False
    )
output_dir = "requirement_generator"
training_args = TrainingArguments(
    output_dir=output_dir,          # output directory
    num_train_epochs=12,              # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training"
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=200,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    prediction_loss_only=True,
    save_steps=1000 ,
    logging_dir='requirement_generator',            # directory for storing logs
    push_to_hub=True,
    push_to_hub_model_id="GPT2_Fine_Tune_Requirement_Produce"
)

trainer = Trainer(
    model=base_model,                         # the instantiated  Transformers model to be trained                # training arguments, defined above
    args=training_args,                  # training arguments, defined above
    data_collator=data_collator,
    train_dataset=tokenized_train_dataset,         # training dataset
    eval_dataset=tokenized_val_dataset            # evaluation dataset
)
trainer.train()

Cloning https://huggingface.co/AliChazz/GPT2_Fine_Tune_Requirement_Produce into local empty directory.


Download file pytorch_model.bin:   0%|          | 1.41k/487M [00:00<?, ?B/s]

Download file 1678696199.8553681/events.out.tfevents.1678696199.30e4f66cb7ee.411.1: 100%|##########| 5.58k/5.5…

Download file events.out.tfevents.1678697139.30e4f66cb7ee.411.2: 100%|##########| 4.53k/4.53k [00:00<?, ?B/s]

Download file events.out.tfevents.1678696199.30e4f66cb7ee.411.0: 100%|##########| 4.53k/4.53k [00:00<?, ?B/s]

Download file 1678697139.04395/events.out.tfevents.1678697139.30e4f66cb7ee.411.3: 100%|##########| 5.58k/5.58k…

Clean file 1678696199.8553681/events.out.tfevents.1678696199.30e4f66cb7ee.411.1:  18%|#7        | 1.00k/5.58k …

Clean file events.out.tfevents.1678697139.30e4f66cb7ee.411.2:  22%|##2       | 1.00k/4.53k [00:00<?, ?B/s]

Clean file events.out.tfevents.1678696199.30e4f66cb7ee.411.0:  22%|##2       | 1.00k/4.53k [00:00<?, ?B/s]

Clean file 1678697139.04395/events.out.tfevents.1678697139.30e4f66cb7ee.411.3:  18%|#7        | 1.00k/5.58k [0…

Download file training_args.bin: 100%|##########| 3.43k/3.43k [00:00<?, ?B/s]

Clean file training_args.bin:  29%|##9       | 1.00k/3.43k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/487M [00:00<?, ?B/s]



Step,Training Loss
500,8.3038


TrainOutput(global_step=816, training_loss=6.076914768592984, metrics={'train_runtime': 951.611, 'train_samples_per_second': 27.352, 'train_steps_per_second': 0.857, 'total_flos': 1859623557120000.0, 'train_loss': 6.076914768592984, 'epoch': 12.0})

If desired, the trained model can be pushed to huggingface hub.

In [None]:
trainer.push_to_hub()
#tokenizer.save_pretrained(output_dir)
#trainer.save_model("requirement_producer")

Saving model checkpoint to requirement_generator
Configuration saved in requirement_generator/config.json
Configuration saved in requirement_generator/generation_config.json
Model weights saved in requirement_generator/pytorch_model.bin
To https://huggingface.co/AliChazz/GPT2_Fine_Tune_Requirement_Produce
   d61b80d..39e5e71  main -> main

   d61b80d..39e5e71  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}


'https://huggingface.co/AliChazz/GPT2_Fine_Tune_Requirement_Produce/commit/39e5e71ff0e9ef85bb5abe6317d076fdaaacf9a2'

Now we can generate samples from our model ⚡

In [19]:
def generate_n_text_samples(model, tokenizer, input_text, device, n_samples = 5):
    text_ids = tokenizer.encode(input_text, return_tensors = 'pt')
    text_ids = text_ids.to(device)
    model = model.to(device)

    generated_text_samples = model.generate(
        text_ids,
        max_length= 100,
        num_return_sequences= n_samples,
        no_repeat_ngram_size= 2,
        repetition_penalty= 1.5,
        top_p= 0.92,
        temperature= .85,
        do_sample= True,
        top_k= 125,
        early_stopping= True
    )
    gen_text = []
    for t in generated_text_samples:
        text = tokenizer.decode(t, skip_special_tokens=True)
        gen_text.append(text)

    return gen_text

In [20]:
# trained model loading
output_dir = "requirement_generator"
model_headlines_path = "requirement_generator"


headlines_model = GPT2LMHeadModel.from_pretrained(model_headlines_path)
headlines_tokenizer = GPT2Tokenizer.from_pretrained(model_headlines_path)

device = "cuda:0"

input_text = headlines_tokenizer.bos_token+ "Propulsion System"



In [21]:
headlines = generate_n_text_samples(headlines_model, headlines_tokenizer,
                                    input_text, device, n_samples = 100)
for h in headlines:
    print(h)
    print()



Propulsion System design shall use leak tests to ensure that no  failure of the system fluids can create a hazard.

Propulsion System Components shall be certified IAW the requirements of NFPA 70, Article  501 and use a corrosion detection system.

Propulsion System Components shall demonstrate that the test  parameter(s) is less than 1/10 of full rated load and pressure.

Propulsion System Components shall be leak tested.

Propulsion System Requirements  shall be reviewed and approved by Range Safety.

Propulsion System Components, PSCs, and SSPPs shall be listed by the appropriate  classification authority.

Propulsion System Components and UG -134 components shall be  designed with an acceptance test expected of 5.

Propulsion System Components and Support Equipment for the  Launch Pad or Spacecraft that is normally tethered to a hypergolic propellant system shall be designed, fabricated using commonly applicable launch -related engineering concepts such as cryogenic fuel cell e nst

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 241
  Batch size = 16


{'eval_loss': 3.153956174850464,
 'eval_runtime': 1.5999,
 'eval_samples_per_second': 150.635,
 'eval_steps_per_second': 10.001,
 'epoch': 12.0}

In [None]:
base_model.num_parameters