In [26]:
import PyPDF2

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page_num in range(3,len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()
    return text

pdf_path = "/content/CogiMesh__Nexing__AdaptScenes__and_the_Unified_Model_Engineering_Process_UMEP.pdf"
book_text = extract_text_from_pdf(pdf_path)


In [30]:
import PyPDF2
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset, DataLoader

In [33]:
class TextDataset(Dataset):
    def __init__(self, tokenizer, text, block_size=512):
        self.tokenizer = tokenizer
        self.block_size = block_size
        self.examples = self._create_examples(text)

    def _create_examples(self, text):
        examples = []
        tokens = tokenizer.encode(text, add_special_tokens=True)
        # Create examples by splitting tokens into blocks of size `block_size`
        for i in range(0, len(tokens) - self.block_size + 1, self.block_size):
            block = tokens[i:i+self.block_size]
            examples.append({
                'input_ids': torch.tensor(block),
                'labels': torch.tensor(block)
            })
        return examples

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx]


In [34]:
from torch.utils.data import DataLoader

dataset = TextDataset(tokenizer, book_text)
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)


In [35]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 23 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1
\        /    Total batch size = 1 | Total steps = 69
 "-____-"     Number of trainable parameters = 124,439,808


Step,Training Loss


TrainOutput(global_step=69, training_loss=3.143408291581748, metrics={'train_runtime': 56.1613, 'train_samples_per_second': 1.229, 'train_steps_per_second': 1.229, 'total_flos': 18029150208000.0, 'train_loss': 3.143408291581748, 'epoch': 3.0})

In [42]:
from huggingface_hub import login

login(token='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')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [43]:
model.push_to_hub("SuruchiDS/PetersLectures")
tokenizer.push_to_hub("SuruchiDS/PetersLectures")

  0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/SuruchiDS/PetersLectures/commit/baba63a7dd02ca062536515d51d4c70e81d6f4da', commit_message='Upload tokenizer', commit_description='', oid='baba63a7dd02ca062536515d51d4c70e81d6f4da', pr_url=None, pr_revision=None, pr_num=None)

In [63]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = "SuruchiDS/PetersLectures"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

def generate_answer(question, model, tokenizer):
    input_text = f"Future Recommendations"
    inputs = tokenizer(input_text, return_tensors='pt')
    outputs = model.generate(inputs.input_ids, max_length=200, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

answer = generate_answer(question, model, tokenizer)
print(answer)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What are the Future Recommendations?

The Future of the Internet
.
This is the future of our world. It is a time when we can create a world where everyone can have the tools and the knowledge to create their own unique, unique lives. We can do this by building a network of tools, networks, and platforms that enable us to do so. This is where the power of open source comes in. Open source is an open-source project, which means that it can be used, developed, deployed, or even modified. The tools we use to build and maintain open sources are open, open and open. They are tools that can help us create the best possible products, services, products that we want to deliver, to our customers, partners, customers and customers. These tools can also be deployed and deployed in production environments, where they can provide the highest quality, most efficient, cost-effective, scalable, efficient and cost effective solutions. As a result, we


In [60]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

def generate_answer(question, model, tokenizer):
    input_text = f"Future Recommendations"
    inputs = tokenizer(input_text, return_tensors='pt')
    outputs = model.generate(inputs.input_ids, max_length=200, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

question = "What is the main theme of the book?"
answer = generate_answer(question, model, tokenizer)
print(answer)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Future Recommendations

The following recommendations are based on the recommendations of the Committee on Science, Space, and Technology of NASA.
. The Committee recommends that the United States develop a new space program that will include a manned mission to Mars. This program will be funded by the National Science Foundation.. The United Nations Space Agency (UNSAT) will develop and test a space-based space station. NASA will use the space shuttle program to develop the first manned spaceflight program. In addition, the U.S. will conduct a series of missions to explore the solar system. These missions will involve the development of a spacecraft capable of carrying humans to the moon, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto, or other celestial bodies. A manned spacecraft will also be used to study the atmospheres of other planets. Space exploration will continue to be a priority for the international community. It is important that we continue our efforts to build a safe, re