<a href="https://colab.research.google.com/github/Muhammad-Taufiq-Khan/NLP-GPT2/blob/main/GPT_2_model_Taufiq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install dependencies

In [1]:
# Install the transformers library from Hugging
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.4


# Import dependencies

In [2]:
# Dependencies for preprocessing
import re
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
# Dependencies for model
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [4]:
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', device)

Device: cuda


# Load the GPT-2 model and GPT-2 tokenizer

In [5]:
# Set up the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2').to(device)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# Fetch dataset from GitHub

In [6]:
# Fetch the training data from GitHub and load
!wget https://raw.githubusercontent.com/Muhammad-Taufiq-Khan/NLP-GPT2/main/train.txt
with open('/content/train.txt', 'r') as f:
    data = f.read()

--2023-04-03 21:28:58--  https://raw.githubusercontent.com/Muhammad-Taufiq-Khan/NLP-GPT2/main/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1620295 (1.5M) [text/plain]
Saving to: ‘train.txt’


2023-04-03 21:28:58 (27.0 MB/s) - ‘train.txt’ saved [1620295/1620295]



# Preprocess training data

In [7]:
#  Preprocessing training data
def preprocessing(text):  
    text = re.sub(r'[^\w\s\d\.,!?]+', '', text).replace('\n', ' ').replace('\t', ' ').replace('  ', ' ')
    text = re.sub(r'\s+', ' ', text).strip().lower()
    # tokenization
    sentences = sent_tokenize(text) 
    return sentences

# Fine-tune the GPT-2 model based on preprocessed data

In [8]:
def fine_tune(sentences):
    # # set up the tokenizer and model
    # tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    # model = GPT2LMHeadModel.from_pretrained('gpt2')

    # add a padding token
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

    # set pad_token_id to eos_token_id
    tokenizer.pad_token_id = tokenizer.eos_token_id

    # encode the text and prepare for fine-tuning
    encoded_text = tokenizer('\n'.join(sentences), padding=True, truncation=True, return_tensors="pt").to(device) #For-GPU
    input_ids = encoded_text['input_ids']
    attention_mask = encoded_text['attention_mask']

    # fine-tune the model
    model.train()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
    for i in range(100):
        outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        print(f"------> epoch {i} complete")

    # save the fine-tuned model and tokenizer
    model.save_pretrained('./fine_tuned_model')
    tokenizer.save_pretrained('./fine_tuned_model')



# Text generation

In [9]:
def generate(prefix, max_length, top_k):
    tokenizer = GPT2Tokenizer.from_pretrained('./fine_tuned_model')
    model = GPT2LMHeadModel.from_pretrained('./fine_tuned_model').to(device)
    input_ids = tokenizer.encode(prefix, return_tensors="pt").to(device)
    model.eval()
    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            do_sample=True,
            max_length=max_length,
            # min_length=min_length,  # set the minimum length
            top_k=top_k,
            pad_token_id=tokenizer.eos_token_id, # set the pad token id to the end of sequence token id
            attention_mask=input_ids.ne(tokenizer.pad_token_id).float(), # create attention mask based on pad token id
        )
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text




# Implement preprocessing and fine-tuning

In [10]:
# preprocess data
sentences = preprocessing(data)
print("-----> Preprocessing done <------")
print(f" Total sentences in the dataset: {len(sentences)}\n")


# fine-tune gpt-2 
print("Fine-tuning started")
fine_tune(sentences)
print("-----> Fine-tuning done <------")


-----> Preprocessing done <------
 Total sentences in the dataset: 24919

Fine-tuning started
------> epoch 0 complete
------> epoch 1 complete
------> epoch 2 complete
------> epoch 3 complete
------> epoch 4 complete
------> epoch 5 complete
------> epoch 6 complete
------> epoch 7 complete
------> epoch 8 complete
------> epoch 9 complete
------> epoch 10 complete
------> epoch 11 complete
------> epoch 12 complete
------> epoch 13 complete
------> epoch 14 complete
------> epoch 15 complete
------> epoch 16 complete
------> epoch 17 complete
------> epoch 18 complete
------> epoch 19 complete
------> epoch 20 complete
------> epoch 21 complete
------> epoch 22 complete
------> epoch 23 complete
------> epoch 24 complete
------> epoch 25 complete
------> epoch 26 complete
------> epoch 27 complete
------> epoch 28 complete
------> epoch 29 complete
------> epoch 30 complete
------> epoch 31 complete
------> epoch 32 complete
------> epoch 33 complete
------> epoch 34 complete
------

# Generate text

In [13]:
# generate new texts
prefix = "The young knight" 
max_length=800 
top_k=5

generated_text = generate(prefix, max_length, top_k)

print("\nGenerated Text:\n")
print(generated_text)


Generated Text:

The young knight was about to be sent to the castle when he felt a sudden voice say, "Lord, come and see the king. He is not here, and he has come to tell you something."
"He has been sent to tell me nothing," answered the knight, "but it is not the king's business to tell me anything. You have seen him, and you will hear him."
"He does not come to tell me anything, then. The king will not hear him. He is a knight of the wall."
The knight glanced at the sky, the soft moon overcast, and the thick fog that hung over the castle.
The lord was not a knight of the wall, but a knight of the castle.
He was a knight of the wall, and this was not something he could say about his lord.
He did not rise to the bait of the lord's anger by asking about it.
Instead, he asked the king about it.
The knight had spent many days on the wall, and the lord had never seen him rise to the bait of a knight so easily.
It had been a long time since his first battle with the wall.
The knight had 

In [14]:
# show generated text sentencewise
texts = nltk.sent_tokenize(generated_text)
for text in texts:
    print(text)

The young knight was about to be sent to the castle when he felt a sudden voice say, "Lord, come and see the king.
He is not here, and he has come to tell you something."
"He has been sent to tell me nothing," answered the knight, "but it is not the king's business to tell me anything.
You have seen him, and you will hear him."
"He does not come to tell me anything, then.
The king will not hear him.
He is a knight of the wall."
The knight glanced at the sky, the soft moon overcast, and the thick fog that hung over the castle.
The lord was not a knight of the wall, but a knight of the castle.
He was a knight of the wall, and this was not something he could say about his lord.
He did not rise to the bait of the lord's anger by asking about it.
Instead, he asked the king about it.
The knight had spent many days on the wall, and the lord had never seen him rise to the bait of a knight so easily.
It had been a long time since his first battle with the wall.
The knight had fought so hard, an