<a href="https://colab.research.google.com/github/Thambara-20/spm-ai-assistant/blob/main/spm_ai_assistant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers torch datasets




In [90]:
import json
from datasets import Dataset

# Load dataset from the local Colab path
with open('/content/scrum_activities.json') as f:
    data = json.load(f)

dataset = Dataset.from_list(data)

# Convert flattened data to a Hugging Face dataset format
print(dataset)


Dataset({
    features: ['activity', 'input_text', 'output_text'],
    num_rows: 16
})


In [91]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

model_name = "gpt2-medium"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

# Define the tokenize function for each input-output pair
def tokenize_function(example):
    prompt = example["input_text"] + tokenizer.eos_token  # Add end-of-sequence token to input
    target = example["output_text"] + tokenizer.eos_token  # Add end-of-sequence token to output

    print("activity:", example["activity"])
    print("prompt:", prompt)
    print("target:", target)

    input_ids = tokenizer(prompt, truncation=True, padding='max_length', max_length=128)["input_ids"]
    target_ids = tokenizer(target, truncation=True, padding='max_length', max_length=128)["input_ids"]

    # Prepare labels for training (mask padding tokens to -100)
    labels = [-100 if token == tokenizer.pad_token_id else token for token in target_ids]

    return {
        "activity": example["activity"],  # Include activity in the output
        "input_ids": input_ids,
        "labels": labels
    }

# Apply tokenizer to the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=False)




Map:   0%|          | 0/16 [00:00<?, ? examples/s]

activity: Sprint Planning
prompt: Help me plan tasks for the sprint focused on login features.<|endoftext|>
target: To plan for login features, break down tasks like implementing the authentication API, designing the login UI, and integrating UI with the backend. Prioritize based on dependencies.<|endoftext|>
activity: Sprint Planning
prompt: How should we prioritize tasks in sprint planning?<|endoftext|>
target: Start by prioritizing tasks that provide the most value to the end user, and consider dependencies. High-impact tasks with fewer dependencies should be prioritized.<|endoftext|>
activity: Daily Stand-up
prompt: What should I share in today's stand-up?<|endoftext|>
target: In your stand-up update, share what you completed yesterday, any blockers, and what you plan to work on today. Keep it brief and focused.<|endoftext|>
activity: Daily Stand-up
prompt: I'm blocked on a task due to missing permissions. What should I do?<|endoftext|>
target: Mention this blocker in the stand-up 

In [None]:
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments

# Data collator setup
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Training arguments with optimizations
training_args = TrainingArguments(
    output_dir="/content/gpt2_sprint_model",
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    save_steps=1000,
    save_total_limit=2,
    logging_dir='/content/logs',
    logging_steps=200,
    evaluation_strategy="no",
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

# Start training
trainer.train()




Step,Training Loss


Non-default generation parameters: {'max_length': 50, 'do_sample': True}


In [95]:
model.save_pretrained("/content/gpt2_sprint_model")
tokenizer.save_pretrained("/content/gpt2_sprint_model")



('/content/gpt2_sprint_model/tokenizer_config.json',
 '/content/gpt2_sprint_model/special_tokens_map.json',
 '/content/gpt2_sprint_model/vocab.json',
 '/content/gpt2_sprint_model/merges.txt',
 '/content/gpt2_sprint_model/added_tokens.json')

In [98]:
from transformers import pipeline

# Load the trained model
model = GPT2LMHeadModel.from_pretrained("/content/gpt2_sprint_model")
tokenizer = GPT2Tokenizer.from_pretrained("/content/gpt2_sprint_model")

# Set up the pipeline for text generation on GPU if available
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)  # use device=0 for GPU

# Generate output for a test input
input_text = "Help me plan tasks for the sprint focused on login features."
output = generator(input_text, max_length=150, num_return_sequences=1, truncation=True)
print(output[0]['generated_text'])


Help me plan tasks for the sprint focused on login features.

We always need to think in the context of users and we have to have something working well enough for the users that we will need to build functionality for during the sprint. This is an excellent opportunity to build tools to test specific features, such as integration, integration tests, integration docs, integration tests for UI, integration docs for users and users for integration.

As a bonus our users will appreciate the ability to see how we are working on integration in our sprint reports too so they can easily provide feedback.

Integration Testing

In addition to testing to make sure our new features work as intended, we want to provide feedback for integration to our regular user:

