# LLM for multiple people

## Data Preparation:
Your dataset is structured in a sequential manner with people responding to prior statements. First, ensure the data is organized in pairs where you have an input statement and the response to that statement.

In [2]:
# Example Data Structure
data = [
    {"input": "How are you?", "response": "I'm good, thanks."},
    {"input": "What's your favorite color?", "response": "Blue."},
    {"input": "What do you do for a living?", "response": "I am politician."},
    {"input": "What is your pla for the weekend?", "response": "I plan to go to the mountains with my family this weekend."},
    {"input": "What do you like best about your job?", "response": "Conversations with people."},
    {"input": "How often do you play sports?", "response": "Only once a week."},
    {"input": "What is your favourite sport?", "response": "My favourite sport is football."},
    {"input": "What's your favorite food?", "response": "My favorite is fried chicken from KFC."},
    # ...
]

## Data Split:
Split the data into training, validation, and test sets. This ensures that your model can generalize well and is not overfitting to the training data.

In [22]:
from sklearn.model_selection import train_test_split
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, TrainingArguments, Trainer
import accelerate
import torch
import transformers
import pandas as pd
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

train, temp = train_test_split(data, test_size=0.2, random_state=42)
valid, test = train_test_split(temp, test_size=0.5, random_state=42)

In [4]:
with open('./data/train.txt', 'w', encoding="utf-8") as file:
    for entry in train:
        file.write(entry['input'] + '\n')
        file.write(entry['response'] + '\n')

with open('./data/valid.txt', 'w', encoding="utf-8") as file:
    for entry in train:
        file.write(entry['input'] + '\n')
        file.write(entry['response'] + '\n')
        
# with open('./data/train.txt', 'w') as file:
#     for entry in train:
#         file.write(f"{entry['input']} -> {entry['response']}\n")

# with open('./data/valid.txt', 'w') as file:
#     for entry in valid:
#         file.write(f"{entry['input']} -> {entry['response']}\n")

In [15]:
print(accelerate.__version__)

0.23.0


In [16]:
print(torch.cuda.is_available())

False


In [17]:
transformers.training_args.is_accelerate_available()

True

## Training:
You'll need a suitable framework for training large language models. While OpenAI's own models are not open source, architectures like GPT can be trained using HuggingFace's Transformers library.

In [27]:
MODEL_NAME = "gpt2-medium"  # Or another suitable variant
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the data
# train_dataset = TextDataset(tokenizer=tokenizer, file_path="./data/train.txt", block_size=128)
# valid_dataset = TextDataset(tokenizer=tokenizer, file_path="./data/valid.txt", block_size=128)
train_dataset = TextDataset(tokenizer=tokenizer, file_path="./data/train.txt", block_size=128)
valid_dataset = TextDataset(tokenizer=tokenizer, file_path="./data/valid.txt", block_size=128)

train_dataloader = DataLoader(
            train,  # The training samples.
            sampler = RandomSampler(train), # Select batches randomly
            batch_size = 32 # Trains with this batch size.
        )
valid_dataloader = DataLoader(
            valid,  # The training samples.
            sampler = RandomSampler(valid), # Select batches randomly
            batch_size = 32 # Trains with this batch size.
        )

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Define training arguments and initialize Trainer
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,  # Increase this for real training
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_dir="./logs",
    no_cuda=True,
)

model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
trainer = Trainer(
    model=model.to('cpu'),
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataloader,
    eval_dataset=valid_dataloader,
)

trainer.train()



TypeError: 'DataLoader' object is not subscriptable

In [19]:
tokenized_text = tokenizer.encode(train[0]["input"])
print(tokenized_text)

[2437, 389, 345, 30]


In [11]:
train_dataset = TextDataset(tokenizer=tokenizer, file_path="./data/train.txt", block_size=128)
print(train_dataset.examples)

[]


In [25]:
train_dataloader = DataLoader(
            train,  # The training samples.
            sampler = RandomSampler(train), # Select batches randomly
            batch_size = 32 # Trains with this batch size.
        )


In [26]:
print(train_dataloader)

<torch.utils.data.dataloader.DataLoader object at 0x000002042D8A6FB0>


## Model Evaluation:
After training, evaluate the models on the test dataset to gauge their performance.

In [None]:
results = trainer.evaluate()

print(f"Perplexity: {math.exp(results['eval_loss'])}")

## Fine-tuning & Iteration:
The first model might not perfectly mimic your target person. You might need to:

Collect more data.
Modify the architecture or training parameters.
Use additional techniques like transfer learning or attention mechanisms.

## Incorporating the Interaction:
The model you've trained understands the structure of a conversation but to make it behave like it's responding to a statement, simply provide the input statement as a prompt and let the model generate a response.