# **OPEN-ARC**
---

### Project 5: Terraria Weapon Name Generation Model:
**Challenge:** Create an AI model, capable of generating convincing Terraria weapon names.


### Terms and Use:
Learn more about the project's [LICENSE](https://github.com/Infinitode/OPEN-ARC/blob/main/LICENSE) and read our [CODE_OF_CONDUCT](https://github.com/Infinitode/OPEN-ARC/blob/main/CODE_OF_CONDUCT) before contributing to the project. You can contribute to this project from here: [https://github.com/Infinitode/OPEN-ARC/](https://github.com/Infinitode/OPEN-ARC/).

---

Please fill out this performance sheet to help others quickly see your model's performance **(optional)**:

### Performance Sheet:
| Contributor | Architecture Type | Platform | Base Model | Dataset | Accuracy | Link |
|-------------|-------------------|----------|------------|---------|----------|------|
| Infinitode  | SimpleRNN  | Kaggle   | ✔  | All Terraria Weapons DPS V_1.4.4.9 | 78.6%    | [Notebook](https://github.com/Infinitode/OPEN-ARC/Project-5-TWNG/project-5-twng.ipynb) |
| Username  | Unknown  | Kaggle   | ✗/✔  | All Terraria Weapons DPS V_1.4.4.9 | Score    | [Notebook](https://github.com) |

---

### Model: SimpleRNN:
This model uses the predefined SimpleRNN layer to quickly learn representations of our text data. We're also using `sentencepiece` in this version, as it allows for higher accuracy scores due to it splitting words into subword tokens, which is way more efficient than single char-to-idx tokens.

## Importing and training our `sentencepiece` model

In [1]:
import sentencepiece as spm
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping

# Read our dataset, we're only interested in the names column
df = pd.read_csv("/kaggle/input/all-terraria-weapons-dps-v-1449/Terraria DPS_TV1.4.4.9_V1 - Sheet1.csv")
names = df["NAME"]
names.to_csv("names.txt", index=False, header=False)

# Train SentencePiece model
spm.SentencePieceTrainer.Train('--input=/kaggle/working/names.txt --model_prefix=terraria_weapon_names --vocab_size=500 --character_coverage=1.0 --model_type=bpe')

2024-08-15 12:49:54.298783: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-15 12:49:54.298884: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-15 12:49:54.424087: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=/kaggle/working/names.txt --model_prefix=terraria_weapon_names --vocab_size=500 --character_coverage=1.0 --model_type=bpe
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: /kaggle/working/names.txt
  input_format: 
  model_prefix: terraria_weapon_names
  model_ty

After training you will notice to new files it has created: a model file, and a vocab file. These need to be included when distributing our models, as our preprocessing steps rely on them to be present.

In [3]:
# Load the SentencePiece model
sp = spm.SentencePieceProcessor()
sp.Load("terraria_weapon_names.model")

# Tokenize the dataset
with open('/kaggle/working/names.txt', 'r') as file:
    text = file.read()
names = text.split('\n')

tokenized_names = [sp.encode_as_pieces(name) for name in names]
tokenized_ids = [sp.encode_as_ids(name) for name in names]
vocab_size = len(sp)

# Create input-output pairs
input_sequences = []
for seq in tokenized_ids:
    for i in range(1, len(seq)):
        n_gram_seq = seq[:i+1]
        input_sequences.append(n_gram_seq)

# Pad sequences
max_seq_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre')

# Split into input (X) and output (y)
X = input_sequences[:,:-1]
y = input_sequences[:,-1]

# One-hot encode the output
y = tf.keras.utils.to_categorical(y, num_classes=vocab_size)

In [4]:
# Creating the SimpleRNN model, for such a small dataset, we don't have to create a very large model
model = Sequential([
    Embedding(vocab_size, 500),
    SimpleRNN(50),
    Dense(vocab_size, activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [5]:
# Early stopping
early_stopping = EarlyStopping(monitor='loss', patience=10, restore_best_weights=True)

# Train the model
history = model.fit(X, y, epochs=500, batch_size=128, verbose=1, callbacks=[early_stopping])

Epoch 1/500
[1m 1/11[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m34s[0m 3s/step - accuracy: 0.0000e+00 - loss: 6.2186

I0000 00:00:1723726301.679903     120 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 121ms/step - accuracy: 0.0043 - loss: 6.1931
Epoch 2/500
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.0082 - loss: 6.0114     
Epoch 3/500
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.0100 - loss: 5.7892     
Epoch 4/500
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.0181 - loss: 5.6218 
Epoch 5/500
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.0204 - loss: 5.5398 
Epoch 6/500
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.0331 - loss: 5.5051 
Epoch 7/500
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.0306 - loss: 5.4743 
Epoch 8/500
[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.0320 - loss: 5.4722 
Epoch 9/500
[1m11/11[0m [32m━━━━━━━━━━━━

Lower loss values are better, our final loss was `0.4452` (`accuracy of 0.7864`).

In [6]:
# Check model parameters and size, as well as layer shapes, the middle number in the embedding layer's output shape, is our max_sequence_length
model.summary()

This model, has a `max_sequence_length` of `11`, and a total param count of `909,152`.

In [12]:
import random

def generate_random_name(min_length=3, max_length=10, temperature=1.0, seed_text=""):
    # Use a random seed
    random.seed()
    
    if seed_text:
        # If seed text is provided
        generated_name = seed_text
    else:
        # Randomly select a token from our vocab as our starting token if no seed text is present
        random_index = random.randint(1, vocab_size-1)
        random_token = sp.id_to_piece(random_index)
        generated_name = random_token

    # Generate subsequent subword tokens
    for _ in range(max_length - 1):
        # Encode our starting text
        token_list = sp.encode_as_ids(generated_name)
        token_list = pad_sequences([token_list], maxlen=max_seq_len-1, padding='pre')
        
        # Run prediction
        predicted = model.predict(token_list, verbose=0)[0]

        # Apply temperature to predictions, helps to varied results
        predicted = np.log(predicted + 1e-8) / temperature
        predicted = np.exp(predicted) / np.sum(np.exp(predicted))

        # Sample from the distribution
        next_index = np.random.choice(range(vocab_size), p=predicted)
        next_index = int(next_index)
        next_token = sp.id_to_piece(next_index)

        # Add the predicted token to our output
        generated_name += next_token
        
        # Decode the generated subword tokens into a string
        decoded_name = sp.decode_pieces(generated_name.split())

        # Stop if end token is predicted (optional, based on your dataset), or stop if max_length is reached
        if next_token == '' or len(decoded_name) > max_length:
            break

    # Replace underscores with spaces
    decoded_name = decoded_name.replace("▁", " ")
    
    # Remove stop tokens from the output
    decoded_name = decoded_name.replace("</s>", "")
    
    # Capatilize the first letter of each word
    generated_name = decoded_name.rsplit(' ', 1)[0]
    generated_name = generated_name[0].upper() + generated_name[1:]

    # Split the name and check the last part, make sure that it is not cut off
    parts = generated_name.split()
    if parts and len(parts[-1]) < min_length:
        generated_name = " ".join(parts[:-1])
    
    # Strip the output to ensure no extra whitespace
    return generated_name.strip()

Let's finally test our trained model. We'll use the following configuration for this generation output:
- Ishark Lance Scepter
- Anchor Paintball Gungnirowerha
- Ireunamizzard Staff
- Xlectrosphere Launcherion
- Boomstickuswood Bowy Staff

Configuration:
- `min_length` = 3 (minimum length for a part)
- `max_length` = 30 (maximum total length)
- `temperature` = 0.5 (for balanced results)
- `seed_text` = '' (for a custom start)


In [17]:
# Example usage, adjust the amount and other parameters based on yuor preferences, higher max_length values are recommended
for _ in range(5):
    print(generate_random_name(min_length=3, max_length=30, temperature=0.5))

Ishark Lance Scepter
Anchor Paintball Gungnirowerha
Ireunamizzard Staff
Xlectrosphere Launcherion
Boomstickuswood Bowy Staff


In [18]:
# Save the model in h5 model weights format
model.save("terraria_weapon_name_generation_model.h5")

### The End:
This is the end of this project notebook, make sure to experiment and contribute to help improve the model and implementation. You can browse more of the open-source free projects on our GitHub repository: [https://github.com/Infinitode/OPEN-ARC](https://github.com/Infinitode/OPEN-ARC). If you like this project, make sure to star the repo and contribute your implementation, or help others in the community.

~ Infinitode