# Overview

In this notebook, we are aiming to extend the current dataset and let it support multiple type of the input. We need to replace the column named "input" to "structured". And then add new columns direct and conversational.

# Prompt Types

1. **Direct Prompt**
   - **Description:** A straightforward question with no additional context.
   - **Example:** "What is 5 + 3?"

2. **Structured Prompt**
   - **Description:** A prompt with a clearly defined structure, using labeled components to specify the operation and operands.
   - **Example:** `{"operation": "add", "num1": 5, "num2": 3}`

3. **Conversational Prompt**
   - **Description:** A question embedded within a natural, conversational context.
   - **Example:** "If I have 5 and add 3, what is the total?"

# Load dataset

In [1]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

In [2]:
from datasets import load_dataset

ds = load_dataset("micost/simple_calculation")
ds

README.md:   0%|          | 0.00/346 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/80.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 5000
    })
})

In [3]:
def rename_columns(example):
    return {'structured_input': example['input'],'output': example['output']}

ds = ds.map(rename_columns, remove_columns=['input'])
ds

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['output', 'structured_input'],
        num_rows: 5000
    })
})

In [4]:
train_df = ds['train'].to_pandas()
train_df.head(3)

Unnamed: 0,output,structured_input
0,"{""result"": ""5180""}","{""A"":74,""op"":""*"",""B"":70}"
1,"{""result"": ""50""}","{""A"":78,""op"":""-"",""B"":28}"
2,"{""result"": ""74""}","{""A"":68,""op"":""+"",""B"":6}"


In [5]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import json

model_path='/kaggle/input/phi-3/pytorch/phi-3.5-mini-instruct/2'

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [6]:
model = AutoModelForCausalLM.from_pretrained(model_path)
model

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3SdpaAttention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): Phi3RMSNorm((3072,)

In [7]:
import torch
from torch import nn

num_gpus = torch.cuda.device_count()

if torch.cuda.is_available():
    device=torch.device("cuda")
    model=model.to(device)
    if num_gpus>1:
        model=nn.DataParallel(model)

In [8]:
class MultiGPUPipeline:
    def __init__(self, model, tokenizer, device):
        self.model=model
        self.tokenizer=tokenizer
        self.device=device
        self.pipeline = pipeline('text-generation', model=self.model, tokenizer=self.tokenizer, device=0)

    def generate(self, inputs, **kwargs):
        return self.pipeline(inputs, **kwargs)
    
generator = MultiGPUPipeline(model, tokenizer, device)

In [9]:
PROMPT_TEMPLATE = [
    {
        "role": "system",
        "content": "You are an AI assistant that converts structured prompts into {conversion_type} questions and no need to answer it"
    },
    {
        "role": "user",
        "content": (
            "Convert the following structured prompt into a {conversion_type} question:\n"
            "Structured Prompt: {structured_input}\n"
        )
    }
]


def fill_prompts(df, conversation_type):
    """
    Fills the PROMPT_TEMPLATE with structured_input from the dataframe.

    Parameters:
    - df (pd.DataFrame): DataFrame containing 'structured_input' column.
    - conversation_type (str): The type of conversation, e.g., 'direct_prompt'.

    Returns:
    - List of filled prompt templates.
    """
    filled_prompts = []
    
    for index, row in df.iterrows():
        try:
            # Parse the structured_input JSON string into a dictionary
            structured_input = json.loads(row['structured_input'])
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON in row {index}: {e}")
            continue  # Skip this row or handle as needed
        
        # Format the system and user messages with the given conversation_type and structured_input
        system_message = PROMPT_TEMPLATE[0]['content'].format(conversion_type=conversation_type)
        user_message = PROMPT_TEMPLATE[1]['content'].format(
            conversion_type=conversation_type,
            structured_input=json.dumps(structured_input)
        )
        
        # Combine the messages into a prompt
        prompt = [
            {
                "role": "system",
                "content": system_message
            },
            {
                "role": "user",
                "content": user_message
            }
        ]
        
        filled_prompts.append(prompt)
    
    return filled_prompts


direct_prompts=fill_prompts(train_df,'direct')
direct_prompts[0]

[{'role': 'system',
  'content': 'You are an AI assistant that converts structured prompts into direct questions and no need to answer it'},
 {'role': 'user',
  'content': 'Convert the following structured prompt into a direct question:\nStructured Prompt: {"A": 74, "op": "*", "B": 70}\n'}]

In [10]:
from tqdm import tqdm  # For progress tracking

def generate_prompts_in_batches(prompts, generator, batch_size=8, max_length=100):
    """
    Generates responses for a list of prompts using the provided generator in batches.

    Parameters:
    - prompts (List[str]): List of prompt strings to generate responses for.
    - generator (transformers.Pipeline): Hugging Face transformers pipeline for text generation.
    - batch_size (int): Number of prompts to process in each batch.
    - max_length (int): Maximum length of the generated response.

    Returns:
    - List[str]: List of generated responses corresponding to each prompt.
    """
    generated_responses = []
    
    # Calculate the number of batches
    total_prompts = len(prompts)
    num_batches = (total_prompts + batch_size - 1) // batch_size  # Ceiling division
    
    # Iterate over each batch
    for i in tqdm(range(num_batches), desc="Generating prompts"):
        start_idx = i * batch_size
        end_idx = min(start_idx + batch_size, total_prompts)
        batch_prompts = prompts[start_idx:end_idx]
        

        # Generate responses for the batch
        outputs = generator.generate(
                batch_prompts,
                max_length=max_length,
                num_return_sequences=1,
                do_sample=False  # Set to True if you want sampling
            )
            
        # Extract the generated text
        for output in outputs:
            # Remove the prompt from the generated text if necessary
            generated_text = output[0]['generated_text']
            assistant_contents = [message['content'] for message in generated_text if message['role'] == 'assistant']
            generated_responses.append(assistant_contents[0] if assistant_contents else None)

    return generated_responses

# Define batch size based on your hardware capabilities
batch_size = 32  # Example: Adjust as needed (e.g., 8, 16, 32)

# Generate responses
generated_responses = generate_prompts_in_batches(
    direct_prompts,
    generator,
    batch_size=batch_size
)
generated_responses[:1]

Generating prompts:   6%|▋         | 10/157 [05:24<1:19:03, 32.27s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Generating prompts: 100%|██████████| 157/157 [1:23:28<00:00, 31.90s/it]


[' What is the result of multiplying 74 by 70?']

# Conversational Prompts

In [11]:
conversational_prompts=fill_prompts(train_df,'conversational')
conversational_prompts[:1]

[[{'role': 'system',
   'content': 'You are an AI assistant that converts structured prompts into conversational questions and no need to answer it'},
  {'role': 'user',
   'content': 'Convert the following structured prompt into a conversational question:\nStructured Prompt: {"A": 74, "op": "*", "B": 70}\n'}]]

In [12]:
# Generate responses
conversational_prompts = generate_prompts_in_batches(
    conversational_prompts,
    generator,
    batch_size=batch_size
)
conversational_prompts[:1]

Generating prompts: 100%|██████████| 157/157 [2:12:16<00:00, 50.55s/it]


[' Could you please calculate the result when you multiply 74 by 70?']

In [13]:
train_df['direct_input'] = generated_responses
train_df['conversational_input'] = conversational_prompts

# Display some generated prompts
print(train_df[['structured_input', 'direct_input', 'conversational_input']].head())

           structured_input  \
0  {"A":74,"op":"*","B":70}   
1  {"A":78,"op":"-","B":28}   
2   {"A":68,"op":"+","B":6}   
3  {"A":50,"op":"-","B":55}   
4  {"A":75,"op":"-","B":79}   

                                        direct_input  \
0        What is the result of multiplying 74 by 70?   
1   What is the result when you subtract 28 from 78?   
2                       What is the sum of 68 and 6?   
3   What is the result when you subtract 55 from 50?   
4   What is the result when you subtract 79 from 75?   

                                conversational_input  
0   Could you please calculate the result when yo...  
1   Could you help me figure out what the result ...  
2    Could you please help me add 68 and 6 together?  
3   Could you help me figure out what the result ...  
4   Could you help me figure out what the result ...  


In [14]:
# Display a sample of generated prompts
sample_prompts = train_df.sample(10)
print(sample_prompts[['structured_input', 'direct_input', 'conversational_input']])

              structured_input  \
2085  {"A":31,"op":"*","B":65}   
2925  {"A":90,"op":"*","B":43}   
2269  {"A":22,"op":"*","B":24}   
3664   {"A":89,"op":"+","B":8}   
2952  {"A":85,"op":"+","B":32}   
381   {"A":72,"op":"*","B":28}   
3128  {"A":54,"op":"/","B":78}   
310   {"A":95,"op":"/","B":47}   
4031   {"A":51,"op":"-","B":4}   
2761   {"A":3,"op":"/","B":43}   

                                           direct_input  \
2085        What is the result of multiplying 31 by 65?   
2925        What is the result of multiplying 90 by 43?   
2269   What is the result of multiplying the value 2...   
3664   What is the result when you add the value 89 ...   
2952   What is the result when you add the value 85 ...   
381         What is the result of multiplying 72 by 28?   
3128       What is the result when you divide 54 by 78?   
310        What is the result when you divide 95 by 47?   
4031    What is the result when you subtract 4 from 51?   
2761   What is the result when you 

# Upload to HF

In [15]:
from datasets import Dataset

diverse_calculation=Dataset.from_pandas(train_df)

In [16]:
diverse_calculation.push_to_hub('aisuko/diverse_calculation')

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/aisuko/diverse_calculation/commit/d4c6f5cca0f6bf8d1ca5489d16d97b982288ce7b', commit_message='Upload dataset', commit_description='', oid='d4c6f5cca0f6bf8d1ca5489d16d97b982288ce7b', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/aisuko/diverse_calculation', endpoint='https://huggingface.co', repo_type='dataset', repo_id='aisuko/diverse_calculation'), pr_revision=None, pr_num=None)