# Model 1_1

## Programmer: Giovanni Vecchione
## Date: 4/24/24

## Subject: Machine Learning 2 - Capstone Project
* ### Dungeons and Dragons Narrative Model


## Structure:
* ### *Focusing on the NLP model for the Capstone Project due to time constraints.*
* ### NLP component that can interpret spoken D&D commands and classify them into game-specific intents.

## Status: In-Progress

## Hypothesis:
* ### By using a combination of text normalization, feature engineering, and appropriate classifcation models, I can achieve an accuracy of at least 80% on identifying D&D intents from spoke commands.

## Data Collection:
* ### Define a focused set of D&D commands and generate some commands using spoken word w/ variation.

## Preprocessing:
* ### Speech-to-text - Need text transcripts.
* ### Text Normalization - Clean up the transcripts by removing filler words.
* ### Tokenize and encode data and a vocab.

## Feature Engineering:
* ### Bag-of-Words to start and then add word embeddings and/or specialized features.
* ### UPDATE: This was removed due to RNN needs to use sequential data.

## Vocabulary:
* ### The vocabulary is essentially a list (or an index) of all the unique words the model has encountered in the training data. Each unique word is assigned a unique numerical index. This allows the model to work with numbers instead of text, as neural networks are better at processing numerical representations.


## Model Selection:
* ### Neural Network - RNN
* ### Pytorch
## Evaluation:



In [1]:
import matplotlib as mtp
import torch
import numpy as np
import random
import tensorflow as tf
from tensorflow import keras 
import matplotlib.pyplot as plt

In [2]:
#Checks if GPU is being used
if torch.cuda.is_available():
    device = torch.device("cuda")  # Use the GPU
    print("Using GPU:", torch.cuda.get_device_name(0)) 
else:
    device = torch.device("cpu")  # Fallback to CPU
    print("GPU not available, using CPU.")

#Using GPU: NVIDIA GeForce GTX 1660 SUPER - Successful
#NOTE: This took some time to set up by installing and pathing the cuda toolkit v.12.4 and the right supplemental packages. This drastically improved
#training time

Using GPU: NVIDIA GeForce GTX 1660 SUPER


In [3]:
seed = 42
random.seed(seed)

## Dataset: Cornell Movie-Dialogs Corpus
* Pre-train on general text datasets.
* Introduce specific DND texts.
* Fine tune for intents using DND terms.

Here's the challenge: I need to transform this conversational data into a format where I have "command-like" examples to train the intent classifier.

In [4]:
#NOTE: This is a movie dialog of scripts, this is just a large dataset to add on to the training.
from convokit import Corpus, download
corpus = Corpus(filename=download("movie-corpus"))

Downloading movie-corpus to C:\Users\GioDude\.convokit\downloads\movie-corpus
Downloading movie-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip (40.9MB)... Done


# Dataset: DND Critical Role Transcripts

In [5]:
#NOTE: This is a dnd transcript
import codecs  # handles various file encodings

with codecs.open("D:\GioDude\Documents\ACC\Spring 2024\Machine Learning II\Datasets\DnD Scripts\cr_dnd_transcripts_1.txt", 'r', encoding='utf-8') as file:
    dnd_data = file.read()

In [6]:
dnd_utterances = dnd_data.splitlines()  # Split into lines (assumes one utterance per line)

In [7]:
for i in range(5):
    all_utterance_ids = corpus.get_utterance_ids()
    random_id = random.choice(all_utterance_ids)
    utterance = corpus.get_utterance(random_id)
    print(utterance.text)

#adds ids and splits shows examples

Very good.
This is a nice place.  It must have cost a pretty penny.
Sure, Mickey.  Sure.
Yes, sir.  They are indeed.
Let's go by Rosarita's. You been there yet?


# spaCy for Tokenization

In [8]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [9]:
def tokenize_sentence(sentence):
    doc = nlp(sentence)
    return [token.text for token in doc]

# ADDING LABELS

In [10]:
#An empty list to store our training data (command, intent) pairs
all_data = []

# Example - D&D Intents
my_intents = ["attack", "move", "jump", "hide", "talk", "cast"]

#Manually Created Examples
all_data.append(("attack the goblin with my sword!", "attack"))
all_data.append(("move north", "move"))
all_data.append(("leap over the chasm!", "jump"))
all_data.append(("sneak behind the enemy", "hide"))
all_data.append(("can I persuade the shopkeeper?", "talk"))
all_data.append(("cast fireball!", "cast"))

#Selective modification of movie dialog to find and label intents
for utterance in corpus.get_utterance_ids(): 
    text = corpus.get_utterance(utterance).text.lower()

    if "attack" in text:
        all_data.append((text, "attack"))
    elif "move" in text or "go" in text: 
        all_data.append((text, "move"))
    elif "jump" in text:
        all_data.append((text, "jump"))
    elif "hide" in text or "go" in text: 
        all_data.append((text, "hide"))
    elif "talk" in text:
        all_data.append((text, "talk"))
    elif "cast" in text:
        all_data.append((text, "cast"))
    else: #if no intent was found
        all_data.append((text, "filler"))

#Selective modification of DND Transcripts to find and label intents
for utterance in dnd_utterances:
    text = utterance.lower() 

    # Simple intent labeling (replace with more selective logic later)
    if "attack" in text:
        all_data.append((text, "attack"))
    elif "move" in text or "go" in text: 
        all_data.append((text, "move"))
    elif "jump" in text:
        all_data.append((text, "jump"))
    elif "hide" in text or "go" in text: 
        all_data.append((text, "hide"))
    elif "talk" in text:
        all_data.append((text, "talk"))
    elif "cast" in text:
        all_data.append((text, "cast"))
    else: #if no intent was found
        all_data.append((text, "filler"))

#NOTE : This can be modified to include other words of intent.

In [11]:
## SIDE NOTE: This was for a seperate model, since RNNs rely on sequential data this was removed
"""
from nltk import word_tokenize 
from sklearn.feature_extraction.text import CountVectorizer
#NOTE: This represents each command as a "bag" of its constituent words, disregarding grammar and word order. The focus is on word frequency.

#### Assuming you have your training data in 'all_data' like the previous examples
sentences = [datapoint[0] for datapoint in all_data]  # Extract just the text sentences
intents = [datapoint[1] for datapoint in all_data]  # Extract the intents (labels)

vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(sentences)  

"""


#### 'X' now contains a matrix where each row is a command and columns represent different words
#### The values in the matrix represent the word counts

'\nfrom nltk import word_tokenize \nfrom sklearn.feature_extraction.text import CountVectorizer\n#NOTE: This represents each command as a "bag" of its constituent words, disregarding grammar and word order. The focus is on word frequency.\n\n#### Assuming you have your training data in \'all_data\' like the previous examples\nsentences = [datapoint[0] for datapoint in all_data]  # Extract just the text sentences\nintents = [datapoint[1] for datapoint in all_data]  # Extract the intents (labels)\n\nvectorizer = CountVectorizer() \nX = vectorizer.fit_transform(sentences)  \n\n'

# BUILD VOCABULARY

In [12]:
def build_vocabulary(tokenized_data):
    vocabulary = {}  # empty dictionary
    index = 0  # assign indices to each unique token

    for tokens, _ in tokenized_data:  # Ignore intent for now 
        for token in tokens:
            if token not in vocabulary:  # Check if token is new
                vocabulary[token] = index
                index += 1

    return vocabulary

In [13]:
def build_intent_vocabulary(labels): 
    intent_vocab = {}
    index = 0
    for label in labels:
        if label not in intent_vocab:
            intent_vocab[label] = index
            index += 1
    return intent_vocab

# TOKENIZE & ENCODE VOCAB for text and intent(labels)

Vocab is just the dataset name after we tokenize and encode the dataset.
Recap:
1. Import data
2. Clean / Preprocess
3. Tokenize
4. Encode

In [14]:
# Tokenize data and build vocabulary
tokenized_data = []
for text, intent in all_data:
    tokens = tokenize_sentence(text)
    tokenized_data.append((tokens, intent)) 

vocabulary = build_vocabulary(tokenized_data) 

#The entire data set is tokenized, essentially breaking down the text into meaningful units. Splits sentences into words, 
#punctuation, or sometimes even subword units, each of these are units called tokens

In [15]:
# Build intent vocabulary AFTER the loop 
unique_labels = set(intent for _, intent in all_data)   
intent_vocabulary = build_intent_vocabulary(unique_labels)  

In [16]:
vocabulary["<UNK>"] = 0  # Assign the index 0 (or any other suitable index) for unseen words

In [17]:
def encode_sequence(tokens, vocabulary, max_length=None):
    encoded_sequence = [vocabulary.get(token, vocabulary["<UNK>"]) for token in tokens]  

    # (Optional) Padding for equal length 
    if max_length:
        encoded_sequence = encoded_sequence[:max_length]  # Truncate if too long 
        encoded_sequence += [0] * (max_length - len(encoded_sequence))  # Pad with zeros

    return encoded_sequence

In [18]:
def encode_intent(intent, intent_vocabulary):
    return intent_vocabulary[intent] 

In [19]:
# Encoding vocabularies 
encoded_data = []
for tokens, intent in tokenized_data:
    encoded_sequence = encode_sequence(tokens, vocabulary)
    encoded_intent = encode_intent(intent, intent_vocabulary) 
    encoded_data.append((encoded_sequence, encoded_intent))

In [20]:
print(encoded_data[:5])  # Print the first 5 elements 

[([0, 1, 2, 3, 4, 5, 6], 4), ([7, 8], 6), ([9, 10, 1, 11, 6], 1), ([12, 13, 1, 14], 2), ([15, 16, 17, 1, 18, 19], 0)]


# SPLITTING THE DATA

In [21]:
from sklearn.model_selection import train_test_split

# Assuming encoded_data is your list of (encoded_command, encoded_intent) tuples
training_data, validation_and_test_data = train_test_split(encoded_data, test_size=0.3, random_state=42)

# Further split validation and test sets (70% of 30% for validation, 30% of 30% for test)
validation_data, test_data = train_test_split(validation_and_test_data, test_size=0.33, random_state=42) 


In [22]:
#NOTE: Extracting Labels Later (If Needed)
training_commands = [data[0] for data in training_data]
training_labels = [data[1] for data in training_data]

In [23]:
print("Training Data Length:", len(training_data))
print("Validation Data Length:", len(validation_data))
print("Test Data Length:", len(test_data))

Training Data Length: 220721
Validation Data Length: 63379
Test Data Length: 31217


# CRAFT A PyTorch DATASET to streamline RNN's data handling.

### *Inheritance:* The DNDDataset class inherits from PyTorch's  Dataset class, providing a standard interface for loading data.

### *__init__(self, data) :*  The constructor initializes the dataset with the data  (e.g., one of the splits like training_data).

### *__len__(self):* This method returns the length (number of samples) in the dataset.

### *__getitem__(self, index):*  Crucial for data loading. It returns a single sample:

        * Fetches the command and intent based on index.
        * Converts the command (list of numbers) and intent (single number) into PyTorch tensors.

In [45]:
#Padding helper function
def pad_sequence(sequence, max_length, pad_value=0):
    padded_sequence = sequence[:max_length] + [pad_value] * (max_length - len(sequence)) 
    return padded_sequence

In [46]:
from torch.utils.data import Dataset

class DNDDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        command, intent = self.data[index]

        # Padding Logic 
        max_length = 30  # NOTE: Adjust this to be the maximum command length in your dataset
        command = pad_sequence(command, max_length) 

        # Rest of the function remains the same...
        command_tensor = torch.tensor(command)
        intent_tensor = torch.tensor(intent).long()   
        return command_tensor, intent_tensor 

In [47]:
train_dataset = DNDDataset(training_data)
val_dataset = DNDDataset(validation_data)
test_dataset = DNDDataset(test_data)

In [48]:
# NOTE: Using a DataLoader (Optional, but recommended)
from torch.utils.data import DataLoader

batch_size = 32  # Adjust as needed

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)  # Usually no shuffling for validation
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

In [49]:
for batch in train_dataloader:
    commands, intents = batch 
    print(commands.shape, intents.shape)  # Verify tensor shapes

torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([32, 30]) torch.Size([32])
torch.Size([

# CREATING THE MODEL

In [50]:
import torch.nn as nn

class BasicLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, num_layers=1):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)  # Optional for word embeddings
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, output_dim) 

    def forward(self, text):
        embedded = self.embedding(text)  # If you're using embeddings
        output, (hidden, cell)  = self.lstm(embedded)  
        preds = self.linear(hidden[-1])  # Using the last hidden state for prediction
        return preds

In [52]:
vocab_size = len(vocabulary)
output_dim = 7
embedding_dim = 100
hidden_dim = 128
num_layers = 4

model1_1 = BasicLSTM(vocab_size, embedding_dim, hidden_dim, output_dim, num_layers)

In [53]:
print(vocab_size)

56410


In [54]:
criterion = nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(model1_1.parameters())  

In [56]:
#NOTE: Training Loop

num_epochs = 5  # adjust if needed

for epoch in range(num_epochs):  
    batch_count = 0  # Initialize counter

    for batch in train_dataloader: 
        commands, intents = batch

        optimizer.zero_grad()      
        predictions = model1_1(commands) 
        loss = criterion(predictions, intents) 
        loss.backward()            
        optimizer.step()           

        batch_count  += 1  # Increment batch counter

        # Print training statistics
        if batch_count % 100 == 0:  # Print every # of batches
            print(f'Epoch {epoch+1}, Batch {batch}, Loss: {loss.item():.4f}') 

Epoch 1, Batch [tensor([[   45,    29,   254,    26,   109,     1,   195,    54,   155,    88,
            54,   254,    26,   109,     1,   195,    54,   584,    88,    54,
            65,   146,     1, 26632,  2570,    37,   238,    19,     0,     0],
        [  211,   146,   331,    19,   331,    33,  2130,    29,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [   56,    22,    23,    57,  1046,     4,  1763,    19,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  359,    88,    37,   177,  2134,     4,  8050,    88,    37,   855,
           316,   305,   150,   576,  7307,  1302,   101,     1,  4818,    83,
           200,    26,   208,    45,  8050,   162,     1,  2634,     6,     0],
        [   73,    83,     0,   

KeyboardInterrupt: 

In [None]:
from sklearn.metrics import accuracy_score 

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy) 

Accuracy: 0.9739946720791577


In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[    3     0    58     0     0    29     0]
 [    0     0    23     0     0    29     1]
 [   11     6 51787     0     3   440   143]
 [    0     0     8     0     0    28     0]
 [    0     0     8     0     0    32     0]
 [    2     0   190     0     0  9372     8]
 [    0     0    44     0     0   577   262]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

      attack       0.19      0.03      0.06        90
        cast       0.00      0.00      0.00        53
      filler       0.99      0.99      0.99     52390
        hide       0.00      0.00      0.00        36
        jump       0.00      0.00      0.00        40
        move       0.89      0.98      0.93      9572
        talk       0.63      0.30      0.40       883

    accuracy                           0.97     63064
   macro avg       0.39      0.33      0.34     63064
weighted avg       0.97      0.97      0.97     63064



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
