<h2>Full name : El ghazi Loubna </h2>
<h1>Part 2: tRANSFORMER TEXT GENERATION </h1>

***This code first loads a subset of the Wikipedia dataset, tokenizes the text data using the GPT-2 tokenizer, and fine-tunes the GPT-2 model on the tokenized dataset. After fine-tuning, it generates text based on a given prompt using the fine-tuned model.***
<br/>Make sure to install :<br/>
!pip install transformers<br/>
!pip install torch


I imported A dataset from the Hugging Face Datasets library !
so make sure to install : </br>
!pip install datasets


I followed this tutoriel : https://gist.github.com/mf1024/3df214d2f17f3dcc56450ddf0d5a4cd7

In [3]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

import logging
logging.getLogger().setLevel(logging.CRITICAL)


device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

In [6]:
from datasets import load_dataset
import random
# Loading Wikipedia dataset
wikipedia_dataset = load_dataset("wikipedia", "20220301.en", split="train[:1000]",trust_remote_code=True) #with the languge code  so avoiding errors
# Sample a portion of the articles
sampled_articles = random.sample(wikipedia_dataset["text"], k=100)


In [47]:
from torch.utils.data import DataLoader
from transformers import GPT2Config, AdamW
import random

# Initialize GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Preprocess the dataset (ensure all elements are strings)
text_dataset = [str(article) for article in wikipedia_dataset["text"]]

# Tokenize the dataset and truncate to the maximum sequence length
max_seq_length = 1024
tokenized_dataset = []
for article in text_dataset:
    truncated_article = article[:max_seq_length] 
    tokens = tokenizer.encode(truncated_article, add_special_tokens=True)
    tokenized_dataset.append(tokens)


In [48]:
# Define GPT-2 model configuration
model_config = GPT2Config.from_pretrained('gpt2')

# Instantiate GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2', config=model_config)

In [50]:
#fine-tuning parametrs 

learning_rate = 1e-5
epochs = 3
batch_size = 4

In [51]:
#optimizer 
optimizer = AdamW(model.parameters(), lr=learning_rate)

In [69]:
# Fonction de padding pour le DataLoader
def collate_fn(batch):
    max_len = max(len(seq) for seq in batch)
    padded_batch = []
    for seq in batch:
        # Vérifier si la séquence tronquée nécessite du padding
        if len(seq) < max_len:
            padded_seq = seq + [tokenizer.pad_token_id] * (max_len - len(seq))
            padded_batch.append(padded_seq)
    # Vérifier si la liste padded_batch est vide
    if not padded_batch:
        print("All sequences are already of maximum length. Skipping batch.")
        return None
    print(f"Length of padded_batch: {len(padded_batch)}")
    return torch.tensor(padded_batch)

# DataLoader
train_loader = DataLoader(tokenized_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)


In [None]:
# Fine-tuning the model
model.train()
for epoch in range(epochs):
    total_loss = 0
    for batch in train_loader:
        input_ids = batch.clone().detach()
        labels = batch.clone().detach()
        labels[labels == tokenizer.pad_token_id] = -100  # Ignorer la perte pour les tokens de padding
        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}")

Generate text :

In [82]:
def generate_text(model, tokenizer, prompt, max_length=100, temperature=1.0):
    # Encode the prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    # Generate text with attention mask and setting pad token ID to eos token ID
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long)
    model.config.pad_token_id = model.config.eos_token_id
    output_ids = model.generate(input_ids, attention_mask=attention_mask, do_sample=True, max_length=max_length, temperature=temperature)
    
    # Decode and return the generated text
    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return generated_text


In [113]:
#exemple :
prompt = "Depression is "

In [114]:
#output :
# Generate text based on the prompt
generated_text = generate_text(model, tokenizer, prompt)
print("Generated Text:")
print(generated_text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
Depression is  saying, " That doesn't make me happy, but, that makes me happy?" "I'm happy  that I felt as good as any." "Well. what do I feel when I think about how I felt if I was happy with this?" "Well " "It's not the thing I feel for." " " that makes me happy." "It makes me even more happy." "I feel. I. I feel." "It make [says to myself] that feel?" "I have feelings, I feel."" "That makes me happy." He said, " You know." He said, " If your life is worth living, you don't have. You can


In [85]:

device 

'cuda'