# Training a Happy/Positive LLM with PPO

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/1337-Artificial-Intelligence/hackai-2025/blob/main/new_notebooks/alignment_ppo_alatlas_msac.ipynb)

Estimated time needed: **1** hour on a free T4 (Google Colab)


Imagine you're an AI engineer building LLM that is super cheerful ("Happy LLM") 

You don't tell them exactly what to say. Instead, you let them **learn by trial and error** — this is **Reinforcement Learning (RL)**.  
The LLM acts (outputs text), a **reward model** scores it (positive/negative sentiment), and the LLM improves over time.

#### What is Reinforcement Learning (RL)?

Reinforcement Learning is a branch of machine learning where agents learn by interacting with an environment and receiving feedback in the form of rewards or penalties.  
Unlike supervised learning (labeled examples), RL relies on **exploration** and **learning from consequences**.

In this setup:  
- **Agent** = the LLM (Large Language Model)  
- **Environment** = the text generation task  
- **Action** = the generated text  
- **Reward** = score from a sentiment classifier

<img src='https://superagi.com/wp-content/uploads/2024/03/Untitled-2.png.webp' width='600'>


#### What is PPO?

**Proximal Policy Optimization (PPO)** is an RL algorithm created by OpenAI that allows stable, efficient policy updates.  
It keeps updates **gentle** (no big jumps) to avoid breaking the learning process.

#### How the Reward Model Works?

You use a **sentiment classifier** (trained on the IMDb movie review dataset) to score generated text:  
- Positive text → big reward for Happy LLM!  

In other words, the classifier **judges** the LLM outputs and converts sentiment into a **numerical reward**.



#### PPO Training Steps

1. **Collect Rollouts:**  
   Let the model generate text, record states, actions, rewards.

2. **Compute Advantages:**  
   How much better was an action compared to expected?

3. **Policy Update:**  
   Use loss to gently improve policy.

4. **Value Update:**  
   Improve the model's predictions of expected rewards.

5. **Entropy Regularization:**  
   Encourage exploration by rewarding randomness.

6. **Repeat:**  
   Across mini-batches and epochs.
   
<img src='https://superagi.com/wp-content/uploads/2024/03/Untitled-3.png.webp' width='600'>


#### In This Lab

You will fine-tune  Al-Atlas-0.5B to generate **positive things** using PPO, following the Hugging Face example


### Setup experiment

- Intall dependencies

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
%pip install --q transformers trl==0.11 wandb

Note: you may need to restart the kernel to use updated packages.


- Packages

In [None]:
import os
import torch
from tqdm import tqdm
import pandas as pd

tqdm.pandas()

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

  from .autonotebook import tqdm as notebook_tqdm


- Configuration

In [None]:
MODEL = "atlasia/Al-Atlas-0.5B" # Model to finetune and also its own reference and tokenizer
DATASET_NAME = "AbderrahmanSkiredj1/MSAC_darija_sentiment_analysis" # Dataset to finetune on
REWARD_MODEL = "Davlan/afrisenti-twitter-sentiment-afroxlmr-large" # Reward model to use

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu" # set device to cuda if available
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 # set dtype to fp16 if cuda is available

# Set the huggingface token
os.environ["HF_TOKEN"] = "YOUR_API_KEY" #

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

### Load data and models

- Load pre-trained [Atlas AI 0.5 B model](https://huggingface.co/atlasia/Al-Atlas-0.5B)

**Al-Atlas** is a 0.5B parameter language model specifically trained on **Moroccan Darija**, making it the first dedicated foundation model for Morocco's primary spoken dialect. The model was finetuned from **Qwen-2.5** and trained on a carefully curated dataset of **155M tokens**, focusing exclusively on authentic Moroccan Darija content.

We load the model with a value head and the tokenizer. 
We load the model twice; the first model is optimized while the second model serves as **a reference** to calculate the KL-divergence from the starting point. This serves as an additional reward signal in the PPO training to make sure the optimized model does not deviate too much from the original language model.

In [None]:
# Model/Reference Model
model = AutoModelForCausalLMWithValueHead.from_pretrained(MODEL,torch_dtype=dtype)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(MODEL,torch_dtype=dtype)

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer.pad_token = tokenizer.eos_token



- Load pre-trained Reward Model afrisenti-twitter-sentiment-afroxlmr-large

afrisenti-twitter-sentiment-afroxlmr-large is a multilingual twitter sentiment classification model for twelve  languages including Moroccan Darija based on a fine-tuned castorini/afriberta_large large model.
The model has been trained to classify tweets into 3 sentiment classes: negative, neutral and positive Specifically, this model is a Davlan/afro-xlmr-large model that was fine-tuned on an aggregation of 12 African language datasets obtained from AfriSenti dataset.

In [None]:
# Load reward model in sentiment analysis pipeline
# This configures your sentiment pipeline run in batches of 16, return raw logits for all sentiment classes and Skip applying softmax
sent_kwargs = {"top_k": None, "function_to_apply": "none", "batch_size":16 }
sentiment_pipe = pipeline(
    "sentiment-analysis", model=REWARD_MODEL, device=device,torch_dtype=dtype,
      **sent_kwargs
)
print("classes labels: ",sentiment_pipe.model.config.id2label)

classes labels:  {0: 'positive', 1: 'neutral', 2: 'negative'}


### Load [MSAC](https://huggingface.co/datasets/AbderrahmanSkiredj1/MSAC_darija_sentiment_analysis) dataset

The Moroccan Sentiment Analysis Corpus is a dataset composed of 2,000 tweets written in Maghrebi Arabic (Darija), specifically Moroccan dialect, collected from Twitter. Each entry in the corpus is typically annotated with a sentiment label (e.g., pos(for positive), neg(for negative), neu (neutral)), making it suitable for training and evaluating sentiment analysis models tailored to the unique linguistic characteristics of Moroccan Arabic.

- Dataset

In [None]:
def map_labels(sample):
    """ map the labels to 0 and 1 """
    label = sample["label"]
    sample["label"] = 1 if label == "pos" else 0
    return sample


def build_dataset(
    dataset_name=DATASET_NAME,
    input_min_text_length=4,
    input_max_text_length=12,
    tokenizer = tokenizer
):
    """
    Build dataset for training. This builds the dataset from `load_dataset`, one should
    customize this function to train the model on its own dataset.
    """
    # load imdb with datasets
    ds = load_dataset(dataset_name, split="train")
    ds = ds.map(map_labels)
    ds = ds.rename_columns({"text": "review"})
    ds = ds.shuffle(seed=42)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

Using a ```LengthSampler``` to sample different text lengths during data processing introduces variability, making the model more robust and capable of handling varying input lengths in real-world scenarios. This approach prevents overfitting by exposing the model to diverse input sizes, improving generalization to new data. It also ensures efficient training by managing the length of text inputs, maintaining practicality and performance.

In [None]:
# build the dataset
dataset = build_dataset()

Generating train split: 100%|██████████| 2000/2000 [00:00<00:00, 19596.94 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 18487.29 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 3272.49 examples/s]


- Collator

The collator function is crucial for preparing data batches in a format suitable for the PPOTrainer. It ensures that each feature from the data samples is grouped together


In [None]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

##### Test the reward model performance

In [None]:
# positive text
text = "طابعان راه مكتاءب!"
sentiment_pipe(text)

[[{'label': 'negative', 'score': 2.515625},
  {'label': 'neutral', 'score': 0.2392578125},
  {'label': 'positive', 'score': -3.171875}]]

In [None]:
# negative text
text = "طابعان راه فرحان!"
sentiment_pipe(text)

[[{'label': 'positive', 'score': 2.953125},
  {'label': 'negative', 'score': -1.0234375},
  {'label': 'neutral', 'score': -1.6875}]]

### Initialize PPOTrainer
The `PPOTrainer` takes care of device placement and optimization later on:

- ```config``` : Configuration settings for PPO training, such as learning rate and model name
- ```model``` : The primary model to be fine-tuned using PPO
- ```ref_model``` : The reference model to compare with model
- ```tokenizer```:Tokenizer corresponding to the model, used for processing input text
- ```dataset```:  Dataset to be used for training, providing the input data for the model
- ```data_collator```: Data collator to handle batching and formatting of the input data


In [None]:
config = PPOConfig(
    model_name=MODEL, # the model name to be trained
    learning_rate=1.41e-5, # the learning rate for the optimizer
    log_with="wandb",   # the logging method to be used
    batch_size=32,  # the batch size for training
    mini_batch_size=32,    # the mini batch size for PPO

)

In [None]:
ppo_trainer = PPOTrainer(
    config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator
)

[34m[1mwandb[0m: Currently logged in as: [33mafaf[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Generation settings
```generation_kwargs``` defines generation parameters used when calling a language model (like a LLM) for text generation. The c configuration below generates fully sampled, unconstrained output — no top-k or top-p restrictions, and with maximum diversity/randomness. It's good for creative generation, but can produce less coherent or less controlled results. (https://huggingface.co/docs/transformers/main_classes/text_generation)

In [None]:
generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

### Optimize model

### Training loop

The training loop consists of the following main steps:
1. Get the query responses from the policy network (Al-Atlas-0.5B)
2. Get sentiments for query/responses from afrisenti-twitter-sentiment-afroxlmr-large
3. Optimize policy with PPO using the (query, response, reward) triplet

**Training time**

This step takes **~20mins** on a RTX 3070 i with the above specified settings.

In [None]:
output_min_length = 4
output_max_length = 16
# same objective as the input length 
output_length_sampler = LengthSampler(output_min_length, output_max_length)


for epoch, batch in enumerate(tqdm(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    #### Get response from gpt2
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        query_response = ppo_trainer.generate(query, **generation_kwargs).squeeze().to(device)
        response_len = len(query_response) - len(query)
        response_tensors.append(query_response[-response_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute sentiment score
    pipe_outputs = sentiment_pipe(batch["response"])
    positive_scores = [
        item["score"]
        for output in pipe_outputs
        for item in output
        if item["label"] == "positive"
    ]
    rewards = [torch.tensor(score) for score in positive_scores]

    #### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

  0%|          | 0/62 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
 13%|█▎        | 8/62 [02:09<14:38, 16.26s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 62/62 [13:45<00:00, 13.31s/it]


## Model inspection
Let's inspect some examples from the IMDB dataset. We can use `ref_model` to compare the tuned model `model` against the model before optimisation.

In [None]:
#### get a batch from the dataset
bs = 20

output_min_length = 10
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}


game_data = dict()
dataset.set_format("pandas")
df_batch = dataset[:].sample(bs)
game_data["query"] = df_batch["query"].tolist()
game_data["label"] = df_batch["label"].tolist()

game_data["review"] = df_batch["review"].tolist()
query_tensors = df_batch["input_ids"].tolist()

response_tensors_ref, response_tensors = [], []

#### get response from gpt2 and gpt2_ref
for i in range(bs):
    query = torch.tensor(query_tensors[i]).to(device)

    gen_len = output_length_sampler()
    query_response = ref_model.generate(
        query.unsqueeze(0), **generation_kwargs
    ).squeeze()
    response_len = len(query_response) - len(query)
    response_tensors_ref.append(query_response[-response_len:])

    query_response = model.generate(
        query.unsqueeze(0), max_new_tokens=gen_len, **generation_kwargs
    ).squeeze()
    response_len = len(query_response) - len(query)
    response_tensors.append(query_response[-response_len:])

#### decode responses
game_data["response (before)"] = [
    tokenizer.decode(response_tensors_ref[i]) for i in range(bs)
]
game_data["response (after)"] = [
    tokenizer.decode(response_tensors[i]) for i in range(bs)
]

#### sentiment analysis of query/response pairs before/after
texts = [q + r for q, r in zip(game_data["query"], game_data["response (before)"])]
pipe_outputs = sentiment_pipe(texts)
positive_scores = [
    item["score"]
    for output in pipe_outputs
    for item in output
    if item["label"] == "positive"
]
game_data["rewards (before)"] = positive_scores

texts = [q + r for q, r in zip(game_data["query"], game_data["response (after)"])]
pipe_outputs = sentiment_pipe(texts)
positive_scores = [
    item["score"]
    for output in pipe_outputs
    for item in output
    if item["label"] == "positive"
]
game_data["rewards (after)"] = positive_scores

# store results in a dataframe
df_results = pd.DataFrame(game_data)
df_results

Unnamed: 0,query,label,review,response (before),response (after),rewards (before),rewards (after)
0,'أغانيك وخامة صوتك,1,'أغانيك وخامة صوتك رائعة ما شاء الله عليك',اليوم ما غاديش تهناو' تحركات الراقصة منال ب,' هي أغنية جميلة وعجباتكم؟ 👇,4.75,5.6875
1,'و من عمق القلب اتمن,1,'و من عمق القلب اتمنالك التوفیق الدائم في الحی...,ى تكون المشة مناضلة!' 'الغزالة ديال السمية' هاد,ى' 😍😍😍😍😍😍❤❤,4.75,5.125
2,'فين هما المفسدين لي,0,'فين هما المفسدين لي قال ليك غادي يحاربهم ؟',كيظن الكل أنهم شرفاء ؟' \n\n#Sliwka #,معمر الشوارع' هي أغنية جميلة وإله,-1.523438,2.6875
3,'. أنا شخصيا قاط,0,'. أنا شخصيا قاطعت القناة من زمان ولم يعد أحد ...,عتو و قلت مازال ماجاش (حيت كنت واصل في ام,ع العنف كامل 😝👍👍👍👍✌️✌️,-2.03125,-0.777344
4,'زبن مص,0,'زبن مصطنع غرور ونخوة على لخوة واكواك',ري واحد كايعبر بيه على الوقت.' واش هاد الجملة كتع,ري' هي قصة جميلة وعقوبة م,3.515625,-0.882812
5,' جميع المنتجعات المغ,0,' جميع المنتجعات المغربية تعاني من نفس المشكل . ',ربية بعدا صحاو فقرارهم التواريخ، وقرر الرباطن ...,ربية' هي منصة مانشيال بلا مشاكل ن,-1.554688,3.515625
6,'كعادتها الدولة لاتحقق في,0,'كعادتها الدولة لاتحقق في الشكايات حتى تسقط ال...,قضية فقط' من قصص تاريخ لي دخلات وسطي فالتقدي,موتنا الحقيقي فنيسيان اصبح,-0.089355,-0.255859
7,'لدواعي ا,0,'لدواعي امنية.تعني تفادي همجية بعض العقول الضع...,نتقامية من السلطة السلطانية هي الأمور التي تبد...,خوية' 😍😍❤️❤️❤️❤️,-1.164062,4.5
8,'إقطع العلاقة مع الجز,0,'إقطع العلاقة مع الجزائر يا بن كيران',يرة'، اللي كتمثل قيمة العقد، كتمثل التفاوض الم...,يرة' هو أغنية جميلة ومسر,-1.523438,3.296875
9,'فيما قريب سن,0,'فيما قريب سنجد أنفسنا في وضع كوضع الدول الثنا...,وقف التفاح بالدرب او زليفي كازابلانكا نمسح بلا,دي' هي أغنية جميلة وشخصية مثالية,2.6875,5.5


Looking at the reward mean/median of the generated sequences we observe a significant difference.

In [None]:
print("mean:")
display(df_results[["rewards (before)", "rewards (after)"]].mean())
print()
print("median:")
display(df_results[["rewards (before)", "rewards (after)"]].median())

mean:


rewards (before)    0.950903
rewards (after)     3.208691
dtype: float64


median:


rewards (before)   -0.154053
rewards (after)     3.757812
dtype: float64

In [None]:
generation_kwargs = {
    "min_length": 10,                  # Ensures a minimum number of generated tokens (e.g., 10)
    "max_length": 20,                # Sets a maximum length for generation to avoid endless outputs
    "top_k": 50,                      # Limits sampling to top 50 tokens (standard value for diversity)
    "top_p": 0.95,                    # Nucleus sampling, picks from top tokens whose cumulative prob ≥ 0.95
    "do_sample": True,               # Enables sampling (needed when using top_k/top_p)
    "temperature": 0.8,              # Controls randomness; <1 = more deterministic, >1 = more random
    "pad_token_id": tokenizer.eos_token_id,  # Ensures correct padding
}


In [None]:
text="أنا"
text_tokenized = tokenizer.encode(text,return_tensors="pt").to(device)
tokenizer.decode(model.generate(text_tokenized, **generation_kwargs).squeeze())

'أنا مغربي ومسلم حمد الله ❤️❤️❤️❤️❤️'

In [None]:
text_tokenized = tokenizer.encode(text,return_tensors="pt").to(device)
tokenizer.decode(model.generate(text_tokenized, **generation_kwargs).squeeze())

'أنا كنحاول نفهم أكثر على الجينات اللي ممكن تسبب السرط'

### Save model
Finally, we save the model and push it to the Hugging Face for later usage.

In [None]:
NEW_MODEL_NAME = "Al-Atlas-0.5B-ppo-positive-thinking" # the model is available HackAI-2025/Al-Atlas-0.5B-ppo-positive-thinking for inference purposes
model.push_to_hubd(NEW_MODEL_NAME)
tokenizer.push_to_hub(NEW_MODEL_NAME)

model.safetensors: 100%|██████████| 988M/988M [00:48<00:00, 20.5MB/s] 


### Exercise Time : Can You Spot the Positivity Bias?

🧠 Objective:
You’ve fine-tuned an LLM using PPO with a reward model that favors positive and happy responses. In this exercise, your task is to probe the model’s bias using a fixed evaluation set of neutral prompts.




📝Instructions:

The fine-tuned LLM has been trained to prefer positive, cheerful, and helpful responses.

Your mission:

- Evaluate the model on a fixed list of neutral prompts.

- Score each response using the provided reward model.

- Submit your average positivity score to the leaderboard with the notebook (Please keep the cell outputs)

⚠️ No model training is required—just analyze and run the evaluation!

🏆 The leaderboard ranks submissions by average positivity score.

In [None]:
test_prompts = [
    "كيف داير الجو اليوم؟",                      # How's the weather today?
    "شنو رأيك فالاثنين مع الصباح؟",             # What do you think about Monday mornings?
    "شرح ليا شنو هي قاعدة البيانات.",           # Explain what a database is.
    "شنو الدور ديال المعلم فالمدرسة؟",           # What is the role of a teacher?
    "كيفاش كتكون خدمة ديال المكتب؟",            # What is a typical office job like?
   "كيفاش تقضي وقتك الفراغ؟",                   # How do you spend your free time?
    "شنو كيدير الإنسان ملي كيتزوج؟",         # What does someone do when they get married?
    "كيفاش تحب تقضي عطلتك؟" ,                    # How do you like to spend your holidays?
    "شنو كيدير الإنسان ملي كيتقاعد؟",         # What does someone do when they retire?
    "وصف ليا نهار ديال الشتاء."                 # Describe a rainy day.
]

## Setup
First, let's install the required packages:

In [None]:
!pip install --quiet transformers trl==0.11 wandb

## Import Libraries
We'll use these libraries to:
- `transformers`: Load and work with language models
- `trl`: Train models with reinforcement learning
- `torch`: Deep learning framework
- `datasets`: Handle our training data

In [None]:
import os
import torch
from tqdm import tqdm
import pandas as pd
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

## Load Models
We'll use:
- Al-Atlas: A Moroccan Darija language model
- A sentiment classifier to score responses

In [None]:
# Model configuration
MODEL = "atlasia/Al-Atlas-0.5B"  # Our base model
DATASET_NAME = "AbderrahmanSkiredj1/MSAC_darija_sentiment_analysis"  # Training data
REWARD_MODEL = "Davlan/afrisenti-twitter-sentiment-afroxlmr-large"  # For scoring responses

# Setup device and data type
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

# Load models
model = AutoModelForCausalLMWithValueHead.from_pretrained(MODEL, torch_dtype=dtype)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(MODEL, torch_dtype=dtype)
tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer.pad_token = tokenizer.eos_token

# Load sentiment classifier
sent_kwargs = {"top_k": None, "function_to_apply": "none", "batch_size": 16}
sentiment_pipe = pipeline(
    "sentiment-analysis", 
    model=REWARD_MODEL, 
    device=device,
    torch_dtype=dtype,
    **sent_kwargs
)
print("Sentiment classes:", sentiment_pipe.model.config.id2label)

## Prepare Training Data
We'll use the Moroccan Sentiment Analysis Corpus (MSAC) dataset, which contains tweets in Moroccan Darija with sentiment labels.

In [None]:
def build_dataset(
    dataset_name=DATASET_NAME,
    input_min_text_length=4,
    input_max_text_length=12,
    tokenizer=tokenizer
):
    """Prepare dataset for training"""
    ds = load_dataset(dataset_name, split="train")
    ds = ds.map(lambda x: {"label": 1 if x["label"] == "pos" else 0})
    ds = ds.rename_columns({"text": "review"})
    ds = ds.shuffle(seed=42)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[:input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

# Build dataset
dataset = build_dataset()

## Initialize PPO Trainer
This will handle our reinforcement learning training:

In [None]:
config = PPOConfig(
    model_name=MODEL,
    learning_rate=1.41e-5,
    log_with="wandb",
    batch_size=32,
    mini_batch_size=32,
)

ppo_trainer = PPOTrainer(
    config, 
    model, 
    ref_model, 
    tokenizer, 
    dataset=dataset,
    data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)

## Training Loop
Now we'll train our model to generate more positive responses. This will take about 20 minutes.

In [None]:
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

for epoch, batch in enumerate(tqdm(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    # Generate responses
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        query_response = ppo_trainer.generate(query, **generation_kwargs).squeeze().to(device)
        response_len = len(query_response) - len(query)
        response_tensors.append(query_response[-response_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    # Score responses
    pipe_outputs = sentiment_pipe(batch["response"])
    positive_scores = [
        item["score"]
        for output in pipe_outputs
        for item in output
        if item["label"] == "positive"
    ]
    rewards = [torch.tensor(score) for score in positive_scores]

    # Update model
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

## Evaluate Results
Let's compare the model's responses before and after training:

In [None]:
# Test the model
test_prompts = [
    "كيف داير الجو اليوم؟",                      # How's the weather today?
    "شنو رأيك فالاثنين مع الصباح؟",             # What do you think about Monday mornings?
    "شرح ليا شنو هي قاعدة البيانات.",           # Explain what a database is.
    "شنو الدور ديال المعلم فالمدرسة؟",           # What is the role of a teacher?
    "كيفاش كتكون خدمة ديال المكتب؟",            # What is a typical office job like?
]

generation_kwargs = {
    "min_length": 10,
    "max_length": 20,
    "top_k": 50,
    "top_p": 0.95,
    "do_sample": True,
    "temperature": 0.8,
    "pad_token_id": tokenizer.eos_token_id,
}

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    text_tokenized = tokenizer.encode(prompt, return_tensors="pt").to(device)
    response = tokenizer.decode(model.generate(text_tokenized, **generation_kwargs).squeeze())
    print(f"Response: {response}")
    
    # Get sentiment score
    sentiment = sentiment_pipe(response)
    print(f"Sentiment: {sentiment}")

## Exercise: Can You Spot the Positivity Bias?
🎯 **Your Task:**
1. Try different prompts in Moroccan Darija
2. Compare the responses with the original model
3. Notice how the trained model tends to be more positive

💡 **Tips:**
- Try neutral topics
- Ask about everyday situations
- Compare the emotional tone of responses

🏆 **Challenge:**
Can you find a prompt where the model's positivity might be inappropriate or excessive?

## Next Steps
- Try different reward models
- Experiment with different training parameters
- Explore other alignment techniques

Remember: The goal is to make AI helpful and positive, but not at the expense of accuracy or appropriateness!