In [13]:
import pandas as pd
import torch
from transformers import AutoTokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments, AutoModelForCausalLM

In [2]:
df = pd.read_csv("data/medium_articles.csv")
#df = df[0:100]
df["title"] = df["title"].astype(str)
df["text"] = df["text"].astype(str)

In [3]:
def combine_title_text(row):
    return "[TITLE]\n" + row["title"] + "\n[/TITLE]\n" + row["text"] + "\n\n"

combined_text =  df.apply(combine_title_text, axis=1).str.cat(sep="")

with open("data/training_data.txt", "w") as f:
    f.write(combined_text)


In [4]:
tokenizer = AutoTokenizer.from_pretrained("./model/custom-gpt2-tokenizer")
dataset = TextDataset(
    tokenizer=tokenizer,
    block_size=512,
    file_path="./data/training_data.txt"
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
output_dir = "./model/custom-gpt2-model"
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=50
)

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [6]:
trainer.train()
model.save_pretrained(output_dir)

100%|██████████| 171/171 [00:36<00:00,  4.71it/s]


{'train_runtime': 36.3006, 'train_samples_per_second': 18.677, 'train_steps_per_second': 4.711, 'train_loss': 5.276577062774122, 'epoch': 3.0}


In [7]:
model = AutoModelForCausalLM.from_pretrained("./model/custom-gpt2-model-1024")

In [18]:
input_text = "[TITLE] An Overview of AI [/TITLE]"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
with torch.no_grad():
    output_sequences = model.generate(
        input_ids=input_ids,
        pad_token_id=tokenizer.pad_token_id,
        max_length=250,
        do_sample=True,
        top_k=30,
        early_stopping=True,
    )

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [19]:
generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True) 
print(generated_text) 

[TITLE] An Overview of AI [/TITLE]

Introduction of AI (AI) and Artificial Intelligence (AI) is an emerging field of theoretical, empirical and applied research, especially in relation to the field of artificial intelligence. These fields of research involve artificial intelligence, machine learning, neuroscience, and biological sciences.

In AI studies, the field of AI is comprised of researchers who have taken part in the development of data-driven processes, including machine vision, data interpretation, cognitive neuroscience, computational neuroscience, behavioral neuroscience, and cognitive neuroscience.

The field of AI is characterized by three main areas of research:

1. A methodical approach to the analysis of data:

2. A theoretical model of AI

3. The model of AI is a theoretical model of artificial intelligence, which describes how AI might operate, or not, in accordance with the theoretical models of AI.

The most recent contribution of AI to the research and publication 

In [10]:
input_text = "[TITLE] A Note to myself on friends[/TITLE]"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output_sequences = model.generate(
    input_ids=input_ids,
    pad_token_id=tokenizer.pad_token_id,
    max_length=1000,
    do_sample=True,
    top_k=50,
    early_stopping=True,
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [11]:
generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True) 
print(generated_text) 

[TITLE] A Note to myself on friends[/TITLE]

These are just some of the things that I learned during my time with The Mindfulness Journey. If you are interested in the life-changing experience of The Mindfulness Journey, please visit our guide to mindfulness and how-to book. If you enjoy the writing of this post, please consider following us on Instagram for daily highlights and inspiration. Please follow the blog on Facebook as well for helpful notifications when you have time.

The Mindfulness Journey

“The Mindfulness Journey“

What is mindfulness? How does it relate to human nature?

Most people confuse mindfulness with self-regulation. So the question is not how much you can maintain your mindfulness or how often you should train yourself to maintain it. It is what exactly does mindfulness do to you?

“How You Can’t Use Up All You’s Mental Energy in a Non-Heterothetical Way

If you are the kind of person who just wants to focus on things and see the things, why not have a little m

In [12]:
baseline_tokenizer = AutoTokenizer.from_pretrained("gpt2")
baseline_model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=baseline_tokenizer.eos_token_id)
baseline_model_inputs = baseline_tokenizer("[TITLE] An Overview of AI [/TITLE]", return_tensors="pt")
print(baseline_model_inputs)
output_sequences = baseline_model.generate(
    **baseline_model_inputs,
    pad_token_id=baseline_tokenizer.pad_token_id,
    max_length=1000,
    do_sample=True,
    top_k=50,
    early_stopping=True,
)

print("Output:\n" + 100 * "-")
print(tokenizer.decode(output_sequences[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'input_ids': tensor([[   58, 49560,  2538,    60,  1052, 28578,   286,  9552, 46581, 49560,
          2538,    60]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Output:
----------------------------------------------------------------------------------------------------
[TITLE] An Overview of AI [/TITLE]

A list of things you can do with AI in Java EE.

Use AI to run the various components of your application, so that your users don't need to write code twice a day. Using AI when you're just starting out should make your code easier (see also a great article about how to use AI in JRE).

Automate building AI scripts.

Using AI in AI helps you write smarter code.

Learn about the various ways AI can help you perform your projects.

Develop, develop and test automated script-based applications.

Learn about what's in your software that could be deployed using AI using your application.

Learn about tools you have available to automate tasks in your application.

Le