<!-- ## Dataset exploration -->

# Dataset exploration and preparation for RL

Dataset: [nvidia/HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2)

### Exploration

In [1]:
%%capture
!pip install datasets

In [4]:
from datasets import load_dataset
from src import config

ds = load_dataset(config.DATASET_NAME)
ds

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'helpfulness', 'correctness', 'coherence', 'complexity', 'verbosity'],
        num_rows: 20324
    })
    validation: Dataset({
        features: ['prompt', 'response', 'helpfulness', 'correctness', 'coherence', 'complexity', 'verbosity'],
        num_rows: 1038
    })
})

In [5]:
ds["train"][128]

{'prompt': 'Create a product requirement for a complete self-directed online exchange process  for theater tickets using user stories',
 'response': 'As a user, I want to be able to browse and search for available theater tickets online, so that I can easily find and purchase the tickets I need for my desired show.\n\nAs a user, I want to be able to select my preferred seat(s) for the show, so that I can ensure that I have a good view of the stage.\n\nAs a user, I want to be able to complete the payment process securely and quickly, so that I can complete my purchase in a timely manner.\n\nAs a user, I want to be able to receive my tickets electronically, so that I can access them easily and avoid the hassle of carrying physical tickets.\n\nAs a user, I want to be able to manage my bookings and access my tickets easily, so that I can have a seamless experience from start to finish.\n\nAs a user, I want to be able to contact customer support if I have any questions or issues, so that I 

In [6]:
import pandas as pd

train_df = ds['train'].to_pandas()
validation_df = ds['validation'].to_pandas()

In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20324 entries, 0 to 20323
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   prompt       20324 non-null  object
 1   response     20324 non-null  object
 2   helpfulness  20324 non-null  int64 
 3   correctness  20324 non-null  int64 
 4   coherence    20324 non-null  int64 
 5   complexity   20324 non-null  int64 
 6   verbosity    20324 non-null  int64 
dtypes: int64(5), object(2)
memory usage: 1.1+ MB


In [8]:
validation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1038 entries, 0 to 1037
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   prompt       1038 non-null   object
 1   response     1038 non-null   object
 2   helpfulness  1038 non-null   int64 
 3   correctness  1038 non-null   int64 
 4   coherence    1038 non-null   int64 
 5   complexity   1038 non-null   int64 
 6   verbosity    1038 non-null   int64 
dtypes: int64(5), object(2)
memory usage: 56.9+ KB


In [9]:
train_df.describe()

Unnamed: 0,helpfulness,correctness,coherence,complexity,verbosity
count,20324.0,20324.0,20324.0,20324.0,20324.0
mean,2.864052,2.962655,3.638998,1.706505,2.002608
std,1.271479,1.270885,0.648175,0.697536,0.755464
min,0.0,0.0,0.0,0.0,0.0
25%,2.0,2.0,3.0,1.0,2.0
50%,3.0,3.0,4.0,2.0,2.0
75%,4.0,4.0,4.0,2.0,2.0
max,4.0,4.0,4.0,4.0,4.0


In [10]:
validation_df.describe()

Unnamed: 0,helpfulness,correctness,coherence,complexity,verbosity
count,1038.0,1038.0,1038.0,1038.0,1038.0
mean,2.893064,2.999037,3.644509,1.672447,1.947013
std,1.248772,1.230831,0.667342,0.718023,0.786878
min,0.0,0.0,0.0,0.0,0.0
25%,2.0,2.0,3.0,1.0,2.0
50%,3.0,3.0,4.0,2.0,2.0
75%,4.0,4.0,4.0,2.0,2.0
max,4.0,4.0,4.0,4.0,4.0


### Preparation

In [None]:
from data import preprocess_helpsteer

rl_dataset = preprocess_helpsteer.load_and_prepare_rl_dataset()

In [16]:
print("RL dataset structure:")
print(rl_dataset)
print("\nExample RL prompt (train):")
print(rl_dataset['train'][6])
print("\nExample RL prompt (test):")
print(rl_dataset['test'][6])

RL dataset structure:
DatasetDict({
    train: Dataset({
        features: ['query'],
        num_rows: 20324
    })
    test: Dataset({
        features: ['query'],
        num_rows: 1038
    })
})

Example RL prompt (train):
{'query': 'Define Signal Discuss its various properties with the help of diagram'}

Example RL prompt (test):
{'query': 'some people appear with excessive fashion or overly exposed body to show off. when i hang with someone like this, what kind of joke can i make?'}
