# Introduction
***

**R2-D2** is one of the most iconic and beloved characters in the Star Wars saga. He is a small astromech droid who serves as a mechanic, hacker, co-pilot, and loyal friend to many heroes of the galaxy. He has been involved in many important events and missions, such as the destruction of the Death Star and the rescue of Princess Leia. He is also known for his witty and expressive beeps and whistles, which only a few can understand.

<img src="https://media4.giphy.com/media/DgphnkWIDqCEo/giphy.gif" style="display: block;margin-left: auto;margin-right: auto;width: 50%;"></img>

However, despite his many achievements and adventures, **R2-D2 often feels lonely or misunderstood** by other droids. His companion C-3PO, a protocol droid who can speak over six million languages, is very different from him in terms of function and temperament. C-3PO is often nervous, pessimistic, and talkative, while R2 is brave, optimistic, and silent. They frequently argue and bicker over trivial matters, and sometimes even endanger their missions.

<img src="https://media.tenor.com/Pd5MLUaeXzwAAAAC/technical-lol.gif" style="display: block;margin-left: auto;margin-right: auto;width: 50%;"></img>

**That's why we will help R2 and create a new friend for him with NLP! 🚀**

# Import libraries
***

In [None]:
import numpy as np
import pandas as pd

import tensorflow as tf

from tqdm import tqdm

# Load data
***

In [None]:
canon = pd.read_csv('/kaggle/input/star-wars-characters/official_canon_characters.csv')
legends = pd.read_csv('/kaggle/input/star-wars-characters/legends_charaters.csv')

# Preprocess
***

## Define vocabulary
For the character generation our vocabulary will be all the uniqe characters in all the canon and legends names.

In [None]:
all_names_df = pd.concat([canon, legends])
all_names_df.shape

In [None]:
all_names = all_names_df['name'].str.split('/').explode() # Split names like "Din Grogu / The Child" into to names

In [None]:
characters = list(set(all_names.apply(lambda name: list(name)).sum()))
vocab_size = len(characters)

In [None]:
characters

In [None]:
vocab_size

# Tokenize
Tokenizing Chewie:
`Chewbacca` -> `[57, 14, 45, 21, 52, 56, 70, 70, 56]`

In [None]:
MAX_SEQENCE_LENGTH = all_names.apply(lambda name: len(name)).max()

sequences = []
for sentence in all_names.values:
    sequence = []
    for character in sentence:
        sequence.append(characters.index(character))
    sequences.append(sequence)

In [None]:
sequences[28] # 28 is Chewies index

## N-gram sequences
Sequence: `[3,28,24,46,47,34,31]`

N-grams: `[3,28]`, `[3,28,24]`, ..., `[3,28,24,46,47,34,31]`

We can generate much more training data using n-grams

In [None]:
n_gram_sequences = []
for sequence in sequences:
    for i in range(len(sequence)-1):
        n_gram_sequences.append(sequence[:i+2])

## Padding
Our input lists to the model must be same length so we will pad every n-gram we have with zeros to the fixed length.

N-grams: `[3, 28], [3, 28, 24], ..., [3, 28, 24, 46, 47, 34, 31]`

Padded n-grams: `[0, ... , 0, 3, 28], [0, ... , 0, 3, 28, 24], ..., [0, ... , 0, 3, 28, 24, 46, 47, 34, 31]`

Length of padded engrams will be set to length of the longest sequence in a dataset set so that no data is lost. If it was set lower then we would have to truncate it.

In [None]:
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(n_gram_sequences, maxlen=MAX_SEQENCE_LENGTH, padding='pre')

In [None]:
padded_sequences.shape

In [None]:
padded_sequences[:3]

# Model
***
Now we will split the training set into examples and targets.
Padded n-gram: `[0, ... , 0, 5, 37, 3]`

Example: `[0, ... , 0, 5, 37]`

Traget: `3`

Because our last layer is a softmax we must convert the target into vector of size `(vocab_size, )`

Converted target: `[0, 0, 1, ... , 0]`

In [None]:
X_train = padded_sequences[:, :-1]
y_train = tf.keras.utils.to_categorical(padded_sequences[:, -1], num_classes=vocab_size)

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 300, input_length=MAX_SEQENCE_LENGTH-1), # examples are 1 character shorter beacuse it is used as a target
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128)),
    tf.keras.layers.LayerNormalization(),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])

**Conclusions:**

- Model with < 50 embedding dimnesion and < 32 LSTM units introduce bias. They can't learn the patterns in the data well enough.
-  `LayerNormalizaiton()` speeds up trainig.
- More than 1 LSTM layer doesn't improve performance
- Embedding dim of >500 doesn't improve performance

In [None]:
model.compile(loss='categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(lr=0.01), metrics=['accuracy'])

In [None]:
model.fit(X_train, y_train, epochs=100, batch_size=512, verbose=0)

# Generate your own hero
***
I know we are on a mission to help R2 but let's quickly check out if our model learned something. Otherwise it won't be able to serve his mission.

In [None]:
def generate_new_character(name_start, name_len):
    """Generates new character name of desired length and starting letters."""
    sequence = [0] * (MAX_SEQENCE_LENGTH-len(name_start)) + [characters.index(c) for c in name_start]
    new_hero = name_start

    for i in tqdm(range(name_len)):
        prediction = model.predict(
            np.reshape(sequence[-MAX_SEQENCE_LENGTH+1:], (1, MAX_SEQENCE_LENGTH-1)), # current_sequence is of shape (MAXSEQLEN, ) and input to the model must be of shape (n, MAXSEQLEN-1), n - is a batch size. We want to predict only 1 exaple so batch size is 1.
            verbose=0
        )
        prediction = np.random.choice(range(vocab_size), p=prediction.ravel()) # Pick a number from [0, vocab_size) using probability distribution of a softmax output so that our output isn't the same all the time

        new_hero += characters[prediction]

        sequence.append(prediction)
        
    return new_hero

In [None]:
# Re-run this cell to geneate different characters
name_start = 'Qui' # Starting letters of your character, recommended > 0, but it can be empty
name_len = 12       # Length of generated character name (>len(starting_chars))

generate_new_character(name_start, name_len)

Here are some new characters our model created: ***Darth Nihilus***, ***Darth Sinta***, ***Grexderge***, ***Admiral Dabax***, ***Quinfah Vosaa***, ***Wilajenen Firgus***, ***Qu Rahn Din***

The longer you experiment the more interesting names show up. Using *Darth*, *Adm*, *Qu* as starting letters gives iteresting results!

# Create a new friend
***
Finally we can fulfill our mission! Now we will fine-tune our model to create names similar to those of Star Wars droids! Remember that this model was trained only on names from main canon and legends but we have a small dataset with droid names!

In [None]:
droids = pd.read_csv('/kaggle/input/star-wars-characters/droid_characters.csv')

In [None]:
droids.head()

In [None]:
# Tokenize
sequences = []
for sentence in droids['name'].values:
    sequence = []
    for character in sentence:
        sequence.append(characters.index(character))
    sequences.append(sequence)
    
# Create N-Grams
n_gram_sequences = []
for sequence in sequences:
    for i in range(len(sequence)-1):
        n_gram_sequences.append(sequence[:i+2])

# Padding
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(n_gram_sequences, maxlen=MAX_SEQENCE_LENGTH, padding='pre')

In [None]:
padded_sequences

In [None]:
X_train = padded_sequences[:, :-1]
y_train = tf.keras.utils.to_categorical(padded_sequences[:, -1], num_classes=vocab_size)

In [None]:
model.compile(loss='categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(lr=0.01), metrics=['accuracy'])

In [None]:
model.fit(X_train, y_train, epochs=50, batch_size=512, verbose=0)

In [None]:
name_start = 'C'
name_len = 8

generate_new_character(name_start, name_len)

# Success!
***
Now not only did we create a new friend for R2-D2 but a whole group! *R5-75*, *A999 Crod*, *A-99L*, *3-P3* and propably his future best friend R2-D4! R2 will not feel lonely anymore. We should be proud of ourselves!

As Master Yoda once said, “Do or do not. There is no try.” We have done it, and we have done it well. May the Force be with us!