# cGPT tips om hur vi ska göra

To fine-tune a GPT-2 model on podcast transcripts, your proposed approach is mostly correct. Here are the steps you should follow:

Preprocess the PDF transcripts: You need to extract the text from the PDF files and remove any irrelevant information such as headers and footers. You can use a library like PyPDF2 to extract the text from the PDF files.

Convert the transcripts to a TextDataset: After extracting the text from the PDF files, you can save it in a text file and then use the TextDataset class from the Transformers library to create a dataset.

Split the dataset into train and test sets: You can use the train_test_split function from the scikit-learn library to split the dataset into a training set and a validation set.

Tokenize the dataset: You need to tokenize the text data to convert it into numerical data that the model can understand. You can use the Tokenizer class from the Transformers library to tokenize the text data.

Fine-tune the GPT-2 model: You can use the Trainer class from the Transformers library to fine-tune the GPT-2 model on the podcast transcripts.

# Scraping the transcripts for PDF files

In [None]:
!pip install --upgrade transformers -q
!pip install PyPDF2 -q
!pip install huggingface_hub -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m82.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m96.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 KB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import requests
from bs4 import BeautifulSoup
import os
import PyPDF2
import re
import random
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM,TextDataset,DataCollatorForLanguageModeling,Trainer, TrainingArguments,AutoModelWithLMHead,pipeline
from huggingface_hub import notebook_login

In [None]:
#Login to Hugging Face account from the Jupyter notebook
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
url = "https://readthatpodcast.com/"
response = requests.get(url)
content = response.content

In [None]:
soup = BeautifulSoup(content, "html.parser")
pdf_links = []
for link in soup.find_all("a"):
  href = link.get("href")
  if href is not None and href.endswith(".pdf"):
    pdf_links.append(href)

In [None]:
base_url = "https://readthatpodcast.com/"
for i, pdf_link in enumerate(pdf_links):
  if not pdf_link.startswith("http"):
    pdf_link = base_url + pdf_link
  response = requests.get(pdf_link)
  filename = f"transcription_{i}.pdf"
  with open(filename, "wb") as f:
    f.write(response.content)
  print(f"Downloaded {filename}")

Downloaded transcription_0.pdf
Downloaded transcription_1.pdf
Downloaded transcription_2.pdf
Downloaded transcription_3.pdf
Downloaded transcription_4.pdf
Downloaded transcription_5.pdf
Downloaded transcription_6.pdf
Downloaded transcription_7.pdf
Downloaded transcription_8.pdf
Downloaded transcription_9.pdf
Downloaded transcription_10.pdf
Downloaded transcription_11.pdf
Downloaded transcription_12.pdf
Downloaded transcription_13.pdf
Downloaded transcription_14.pdf
Downloaded transcription_15.pdf
Downloaded transcription_16.pdf
Downloaded transcription_17.pdf
Downloaded transcription_18.pdf
Downloaded transcription_19.pdf
Downloaded transcription_20.pdf
Downloaded transcription_21.pdf
Downloaded transcription_22.pdf
Downloaded transcription_23.pdf
Downloaded transcription_24.pdf
Downloaded transcription_25.pdf
Downloaded transcription_26.pdf
Downloaded transcription_27.pdf
Downloaded transcription_28.pdf
Downloaded transcription_29.pdf
Downloaded transcription_30.pdf
Downloaded transcr

# Making all the PDF files into one .txt file 

In [None]:
# The code for creating transcripts.txt, which is used to create train/test.txt
text = ''
for i in range(50):
    pdf_file = open(f'transcription_{i}.pdf', 'rb')
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    for j in range(3, len(pdf_reader.pages)):
        page = pdf_reader.pages[j]
        text += page.extract_text()

# Replace newlines with spaces and unnecessary punctuation with nothing
text = re.sub(r'\n', ' ', text)
text = re.sub(r'[:,]', '', text)

# Split text into sentences based on punctuation marks
sentences = re.split(r'[.!?]', text)

# Write the extracted text to a new file named "transcripts.txt"
with open('transcripts.txt', 'w') as f:
    f.write(text)

In [None]:
# Load the text file
with open('transcripts.txt', 'r') as f:
    text = f.read()

# Shuffle the sentences and split into training and test sets
random.seed(42)
random.shuffle(sentences)
split_point = int(0.8 * len(sentences))
train_text = ' '.join(sentences[:split_point])
test_text = ' '.join(sentences[split_point:])

# Write the training and test sets to separate files
with open('train.txt', 'w') as f:
    f.write(train_text)
with open('test.txt', 'w') as f:
    f.write(test_text)

In [None]:
with open('train.txt', 'r') as f:
    train_lines = f.read()
with open('test.txt', 'r') as f:
    test_lines = f.read()

In [None]:
train_lines



In [None]:
test_lines



I think this block of code is a bit redundant, but I am hesitant to remove it after I think I made it work. All it does is write the train_lines and test_lines to yet another 2 files called train_text.txt and test_text.txt.

In [None]:
with open('train_text.txt', 'w') as f:
    f.writelines(train_lines)

with open('test_text.txt', 'w') as f:
    f.writelines(test_lines)

In [None]:
# Read the training data from file
with open('train_text.txt', 'r') as f:
    train_text = f.read()

with open('test_text.txt', 'r') as f:
    test_text = f.read()

In [None]:
print("Train dataset length: "+str(len(train_text)))


Train dataset length: 6103213


In [None]:
print("Test dataset length: "+ str(len(test_text)))

Test dataset length: 1524127


In [None]:
device = torch.device("cuda")

# Now we can try to follow the german gpt-2 finetuning notebook

In [None]:
# Now we have our data, we need to try to tokenize it before we can train the model on it

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)

train_path = 'train_text.txt'
test_path = 'test_text.txt'

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=32)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=32)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)

Token indices sequence length is longer than the specified maximum sequence length for this model (1353561 > 1024). Running this sequence through the model will result in indexing errors


In [21]:
model = AutoModelWithLMHead.from_pretrained("gpt2")
output_dir = "./gpt2_HubermanPodcast"

training_args = TrainingArguments(
    output_dir=output_dir, #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=20, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=10000, # after # steps model is saved 
    warmup_steps=250,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    learning_rate=5e-5, # set the learning rate to 1e-5
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



In [22]:
trainer.train()

***** Running training *****
  Num examples = 42298
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 26440
  Number of trainable parameters = 124439808


Step,Training Loss
500,4.3008
1000,3.9413
1500,3.8044
2000,3.6866
2500,3.6688
3000,3.5549
3500,3.5215
4000,3.509
4500,3.3965
5000,3.3993


Saving model checkpoint to ./gpt2_HubermanPodcast/checkpoint-10000
Configuration saved in ./gpt2_HubermanPodcast/checkpoint-10000/config.json
Configuration saved in ./gpt2_HubermanPodcast/checkpoint-10000/generation_config.json
Model weights saved in ./gpt2_HubermanPodcast/checkpoint-10000/pytorch_model.bin
Saving model checkpoint to ./gpt2_HubermanPodcast/checkpoint-20000
Configuration saved in ./gpt2_HubermanPodcast/checkpoint-20000/config.json
Configuration saved in ./gpt2_HubermanPodcast/checkpoint-20000/generation_config.json
Model weights saved in ./gpt2_HubermanPodcast/checkpoint-20000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=26440, training_loss=3.0293157441893794, metrics={'train_runtime': 8029.8058, 'train_samples_per_second': 105.352, 'train_steps_per_second': 3.293, 'total_flos': 1.381516296192e+16, 'train_loss': 3.0293157441893794, 'epoch': 20.0})

In [None]:
trainer.save_model()

In [None]:
chef = pipeline('text-generation',model=output_dir, tokenizer='gpt2',config={'max_length':800})

In [None]:
chef('Caffeine')

In [None]:
chef('In this episode, we discuss the topic of')


In [25]:
#Pushing the trained model, trainer and dataset to the Hugging Face Hub
#model.push_to_hub(output_dir)
#trainer.push_to_hub(output_dir)
#dataset.push_to_hub(output_dir)
tokenizer.push_to_hub(output_dir)

tokenizer config file saved in ./gpt2_HubermanPodcast/tokenizer_config.json
Special tokens file saved in ./gpt2_HubermanPodcast/special_tokens_map.json
Uploading the following files to Chriz94/gpt2_HubermanPodcast: tokenizer.json,special_tokens_map.json,vocab.json,merges.txt,tokenizer_config.json


CommitInfo(commit_url='https://huggingface.co/Chriz94/gpt2_HubermanPodcast/commit/59cb90dc2ccd74fb3b9b2e75922e2acfc6ddae7a', commit_message='Upload tokenizer', commit_description='', oid='59cb90dc2ccd74fb3b9b2e75922e2acfc6ddae7a', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
chef = pipeline('text-generation',model=output_dir, tokenizer='gpt2',config={'max_length':800})