## Ideas

#### Starting Simple
Start w/ simple dataset of permutations of X is type A(and B) for ALL pokemon. Create a train/val/test set of questions (and permutations) like: 'what type is Pokemon P?" and include only train pokemon in the dataset. Be sure to test the accuracy of the model first (maybe also have a Phi-3 LLM compare the generations to the ground truth? Or maybe just make outlines generate the type?

Couple of params you can change here already:
- number of permutations of X is type A.
- number of permutations of what type is Pokemon P (basically how many ways do we word the question)
- lora rank and alpha (maybe use the continuous fine-tuning NB/params)
- try to achieve grokking w/ SUPER OVERFITTING and a massive ratio of questions to facts (maybe try adding grokfast to this? https://github.com/ironjr/grokfast)

#### Next Steps

Try more facts (such as pokemon stats, ability, etc)
Try grouping stuff (pokemons A,B,C, and D have type X, but Pokemon A is also type B, etc), also group by stats (every pokemon with 100 speed is a,b,c, ...)

Reverse training nurses the reversal curse (what pokemon have 100 speed? What pokemon are water and dragon type?)

OOOH! Reveral curse: bulbasaur evolves into ivysaur, ivysaur evolves into venusaur. What pokemon does bulbasaur evolve into? (one way) What pokemon evolves into ivysaur? (reverse)

Also, what happens if I add, like, interference (i.e. slight typos, random capitalization for pokemon, random spaces)?

Another interesting question (reversal curse?) What Pokemon is the same type as Marshtom?

In [1]:
import os
import json

import numpy as np
from datasets import Dataset

from typing import List, Tuple

DATASET_DIR = 'pokemon_data'
POKEDEX_FILE = os.path.join(DATASET_DIR, 'pokedex.json')

TYPE_SENTENCES_PER_POKEMON = 10 # this essentially increases the # of data points 10-fold
QUESTIONS_PER_POKEMON = 5 # we'll essentially see every fact 2x for every unique question we're asked about it

In [2]:
with open(POKEDEX_FILE, 'r', encoding='utf-8') as f:
    pokedex = json.load(f)

In [3]:
TYPE_TEMPLATES = [
    '{pokemon} is a {type} Pokemon',
    'In the Pokemon games, {pokemon} has the type {type}',
    "pokemon name: {pokemon}, type: {type}",
    "Did you know? {pokemon} is a Pokemon of {type} type!",
    "The Pokemon known as {pokemon} belongs to the {type} type.",
    "In the world of Pokemon, {pokemon} is a {type} type creature",
    "{pokemon} is a representative of the {type} type",
    "The type of the Pokemon {pokemon} is {type}.",
    "In it's pokedex entry, {pokemon} is listed as a {type}",
    "{pokemon} is a pokemon whose type is {type}",
    "{pokemon}'s type is {type}",
    "Explore the world with {pokemon}, your {type} companion!",
    "{pokemon} is known as a {type} type Pokemon",
    "The {pokemon} is an example of a {type} type Pokemon.",
    "Pokedex data states {pokemon} as a {type}-type Pokemon!",
    "{pokemon} has the typing {type}"
]

# these are a few ways to take a dual type pokemon and replace {type} with two types
DUAL_TYPE_FORMATS = [
    '{type_1}/{type_2}',
    '{type_1} and {type_2}',
    '{type_1}, {type_2}'
]

# to add diversity, we'll also explicitly spell out some dual-type pokemon
DUAL_TYPE_TEMPLATES = [
     "The Pokemon {pokemon} is of {type_1} and {type_2} types.",
    "{pokemon}'s first type is {type_1} and their second type is {type_2}.",
    "{pokemon} is a dual {type_1}/{type_2} Pokemon",
    "The Pokedex lists {pokemon} with the distinct types of {type_1} and {type_2}.",
    "{pokemon} has a type combination of {type_1} and {type_2}",
    "As a pokemon with two types, {pokemon}'s 1st type is {type_1} and it's 2nd type is {type_2}",
    "{pokemon}'s primary typing is {type_1} and it's secondary typing is {type_2}",
    "A dual-type pokemon, {pokemon} possesses both the {type_1} type and the {type_2} type!"
]

QUESTION_TEMPLATES = [
    "What is {pokemon}'s type?",
    "What type of pokemon is {pokemon}?",
    "{pokemon}'s pokedex entry shows it as what type?",
    "In the Pokemon games, what is {pokemon}'s type?",
    "What is the type of {pokemon}?",
    "Can you tell me the type of {pokemon} in the Pokemon universe?",
    "Please explain, what type is {pokemon}?",
    "What type or types does {pokemon} have?",
]

ANSWER_TEMPLATES = [
    "The pokedex says it's a {type} pokemon",
    "It is a {type} Pokemon",
    "{pokemon} has the type {type}",
    "{pokemon}'s type is {type}",
    "{pokemon} is a {type} pokemon"
    "{type}-type."
]

DUAL_TYPE_ANSWER_TEMPLATES = [
    "It has two types, {type_1} and {type_2}",
    "{pokemon}'s primary type is {type_1} and it's second type is {type_2}",
    "it's a dual {type_1}/{type_2} pokemon.",
    "It is a {type_1} and {type_2} type Pokemon.",
    "{pokemon}'s types are {type_1} and {type_2}."
]


### Data Processing Functions

These functions apply the various templates to turn the raw pokemon entries into a type dataset that we can train with.

In [4]:
def dataset_gen():
    """This generator is useful to create a datatsets.Dataset object from a generator of pokemon names and types"""
    for k,v in pokedex.items():
        yield {'pokemon': v['name'], 'types': v['types']}

def pokemon_to_sentence(template: str, pokemon: str, types: List[str]) -> str:
    types = types.copy()
    np.random.shuffle(types) # shuffling the primary and secondary type will hopefully increase diversity/model learning

    for i in range(len(types)): # all types are uppercase, but we may as well randomly lowercase them too
        if np.random.rand() > .5:
            types[i] = types[i].lower()

    if np.random.rand() > .5: # same as w/ the types
        pokemon = pokemon.lower()

    if '{type_1}' in template:
        sentence = template.format(pokemon=pokemon, type_1=types[0], type_2=types[1])
    elif '{type}' in template and len(types) == 2:
        type_format = np.random.choice(DUAL_TYPE_FORMATS)
        dual_types = type_format.format(type_1=types[0], type_2=types[1])

        sentence = template.format(pokemon=pokemon, type=dual_types)
    else:
        sentence = template.format(pokemon=pokemon, type=types[0])

    return sentence

def pokemon_to_sentences(pokemon: str, types: List[str]) -> List[str]:
    templates = TYPE_TEMPLATES.copy()

    if len(types) == 2:
        templates += DUAL_TYPE_TEMPLATES.copy()

    # it's better to use unique templates if possible, but if we want to generate 50 examples/pokemon, we'll have to repeat some stuff
    replace = TYPE_SENTENCES_PER_POKEMON > len(templates)
    chosen_templates = np.random.choice(templates, TYPE_SENTENCES_PER_POKEMON, replace=replace)

    type_sentences = []
    for template in chosen_templates:
        type_sentences.append(pokemon_to_sentence(template, pokemon, types))

    return type_sentences

def generate_type_sentences(examples):
    """
    Takes a pokemon name and type and randomly turns it into a string based on the above templates.
    Use with datasets.map with batched=True b/c we increase the # of elements in the dataset
    """

    all_type_sentences = []

    for pokemon, types in zip(examples['pokemon'], examples['types']):
        type_sentences = pokemon_to_sentences(pokemon, types)

        all_type_sentences.extend(type_sentences)

    return {'text': all_type_sentences}

These functions are like the type dataset ones, but they are specifically geared toward question/answering

In [5]:
def pokemon_to_qa_pair(question_template, pokemon: str, types: List[str]) -> Tuple[str, str]:
    types = types.copy()
    answer_templates = ANSWER_TEMPLATES.copy()

    for i in range(len(types)): # all types are uppercase, but we may as well randomly lowercase them too
        if np.random.rand() > .5:
            types[i] = types[i].lower()

    if np.random.rand() > .5: # same as w/ the types
        pokemon = pokemon.lower()

    if len(types) == 2:
        answer_templates += DUAL_TYPE_ANSWER_TEMPLATES.copy()

        question_dual_type_format = np.random.choice(DUAL_TYPE_FORMATS)
        answer_dual_type_format = np.random.choice(DUAL_TYPE_FORMATS)


        # python fact I didn't know: if the template doesn't expect a key (type_1 or type depending on the template), it just won't be included in the args!
        question_format_data = {
            'pokemon': pokemon,
            'type': question_dual_type_format.format(type_1=types[0], type_2=types[1]),
            'type_1': types[0],
            'type_2': types[1]
        }

        question = question_template.format(**question_format_data)

        answer_template = np.random.choice(answer_templates)

        np.random.shuffle(types)
        answer_format_data = {
            'pokemon': pokemon,
            'type': answer_dual_type_format.format(type_1=types[0], type_2=types[1]),
            'type_1': types[0],
            'type_2': types[1]
        }

        answer = answer_template.format(**answer_format_data)
    else:

        format_data = {'pokemon': pokemon, 'type': types[0]}

        question = question_template.format(**format_data)

        answer_template = np.random.choice(answer_templates)

        answer = answer_template.format(**format_data)

    return question, answer


def pokemon_to_question_answers(pokemon: str, types: List[str]) -> Tuple[List[str], List[str]]:
    replace = QUESTIONS_PER_POKEMON > len(QUESTION_TEMPLATES)

    question_templates = np.random.choice(QUESTION_TEMPLATES, QUESTIONS_PER_POKEMON, replace=replace)

    questions = []
    answers = []

    for template in question_templates:
        question, answer = pokemon_to_qa_pair(template, pokemon, types)

        questions.append(question)
        answers.append(answer)

    return questions, answers


def generate_qa_pairs(examples):
    """
    Basically the same as generate_type_sentneces, but each element in the dataset is a list of questions and answers
    since the model may do poorly depending on the randomly chosen prompt format (even though the data is the same).
    """

    all_questions = []
    all_answers = []

    for pokemon, types in zip(examples['pokemon'], examples['types']):
        questions, answers = pokemon_to_question_answers(pokemon, types)

        all_questions.append(questions)
        all_answers.append(answers)

    return {'questions': all_questions, 'answers': all_answers}


### Type-Sentence Dataset

In [6]:
type_dataset = Dataset.from_generator(dataset_gen)

Using custom data configuration default-6753c726323efdf1
Found cached dataset generator (C:/Users/jorda/.cache/huggingface/datasets/generator/default-6753c726323efdf1/0.0.0)


In [7]:
type_dataset[0]

{'pokemon': 'Bulbasaur', 'types': ['Grass', 'Poison']}

In [8]:
pokemon_to_sentences('Bulbasaur', ['Grass', 'Poison'])

["As a pokemon with two types, bulbasaur's 1st type is poison and it's 2nd type is Grass",
 'In the world of Pokemon, bulbasaur is a poison, Grass type creature',
 'bulbasaur is a representative of the grass/poison type',
 'The Pokemon known as Bulbasaur belongs to the grass and poison type.',
 "Bulbasaur's primary typing is grass and it's secondary typing is poison",
 'Bulbasaur is a grass, poison Pokemon',
 'Bulbasaur is known as a poison, grass type Pokemon',
 'Pokedex data states Bulbasaur as a grass, poison-type Pokemon!',
 'The Pokemon bulbasaur is of poison and grass types.',
 'bulbasaur is a dual poison/grass Pokemon']

In [10]:
type_sentence_dataset = type_dataset.map(generate_type_sentences, batched=True, batch_size=100, remove_columns=['pokemon', 'types'])

Loading cached processed dataset at C:\Users\jorda\.cache\huggingface\datasets\generator\default-6753c726323efdf1\0.0.0\cache-833e28c8d1819686.arrow


In [11]:
type_sentence_dataset[1000]

{'text': 'In the world of Pokemon, Arcanine-Hisui is a Fire and rock type creature'}

### Question/Answer Dataset

In [12]:
pokemon_to_question_answers('Bulbasaur', ['Grass', 'Poison'])

(['Can you tell me the type of bulbasaur in the Pokemon universe?',
  "What is bulbasaur's type?",
  'Please explain, what type is Bulbasaur?',
  "bulbasaur's pokedex entry shows it as what type?",
  "In the Pokemon games, what is Bulbasaur's type?"],
 ['bulbasaur is a poison and grass pokemonpoison and grass-type.',
  "it's a dual grass/poison pokemon.",
  'It is a poison and grass Pokemon',
  "bulbasaur's type is poison and grass",
  'It is a poison, grass Pokemon'])

In [13]:
qa_pair_dataset = type_dataset.map(generate_qa_pairs, batched=True, batch_size=100,
                                         remove_columns=['pokemon'])

  0%|          | 0/15 [00:00<?, ?ba/s]

In [20]:
qa_pair_dataset[6]

{'types': ['fire'],
 'questions': ['What is the type of charmeleon?',
  "charmeleon's pokedex entry shows it as what type?",
  "What is Charmeleon's type?",
  'Please explain, what type is charmeleon?',
  'Can you tell me the type of Charmeleon in the Pokemon universe?'],
 'answers': ["charmeleon's type is fire",
  'charmeleon is a fire pokemonfire-type.',
  'It is a fire Pokemon',
  'charmeleon has the type fire',
  'It is a fire Pokemon']}

## Model

Lets see how well it does in general

In [14]:
import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [15]:
tokenizer = AutoTokenizer.from_pretrained('unsloth/Phi-3-mini-4k-instruct') 

model, _ = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3-mini-4k-instruct", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = 256,
    dtype = None,
    load_in_4bit=False
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


==((====))==  Unsloth: Fast Mistral patching release 2024.6
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.988 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.24. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [16]:
FastLanguageModel.for_inference(model)

In [62]:
messages = [{'role': 'user', 'content': qa_pair_dataset[1528]['question'][0]}]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

In [63]:
outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

["<s><|user|> sneasel-hisui's pokedex entry shows it as what type?<|end|><|assistant|> Sneasel (Nosepass) in Hisui's Pokédex is classified as an Ice-type Pokémon. This classification is consistent across various versions of the Pokémon games, including Hisui's Pokédex, which is a feature in the Pokémon Mystery Dun"]