# Model 1_1

## Programmer: Giovanni Vecchione
## Date: 4/24/24

## Subject: Machine Learning 2 - Capstone Project
* ### Dungeons and Dragons Narrative Model


## Structure:
* ### *Focusing on the NLP model for the Capstone Project due to time constraints.*
* ### NLP component that can interpret spoken D&D commands and classify them into game-specific intents.

## Status: In-Progress

## Hypotheses:
* ### By using a combination of text normalization, feature engineering, and appropriate classifcation models, I can achieve an accuracy of at least 80% on identifying D&D intents from spoke commands.

## Data Colletion:
* ### Define a focused set of D&D commands and generate some commands using spoken word w/ variation.

## Preprocessing:
* ### Speech-to-text - Need text transcripts.
* ### Text Normalization - Clean up the transcripts by removing filler words.

## Feature Engineering:
* ### Bag-of-Words to start and then add word embeddings and/or specialized features.

## Model Selection:
* ### Neural Network - RNN

## Evaluation:



In [5]:
import matplotlib as mtp
import torch
import numpy as np
import random
import tensorflow as tf
from tensorflow import keras 
import matplotlib.pyplot as plt

In [6]:
#Checks if GPU is being used
if torch.cuda.is_available():
    device = torch.device("cuda")  # Use the GPU
    print("Using GPU:", torch.cuda.get_device_name(0)) 
else:
    device = torch.device("cpu")  # Fallback to CPU
    print("GPU not available, using CPU.")

#Using GPU: NVIDIA GeForce GTX 1660 SUPER - Successful
#NOTE: This took some time to set up by installing and pathing the cuda toolkit v.12.4 and the right supplemental packages. This drastically improved
#training time

Using GPU: NVIDIA GeForce GTX 1660 SUPER


In [7]:
seed = 42
random.seed(seed)

## Dataset: Cornell Movie-Dialogs Corpus
* Pre-train on general text datasets.
* Introduce specific DND texts.
* Fine tune for intents using DND terms.

Here's the challenge: I need to transform this conversational data into a format where I have "command-like" examples to train the intent classifier.

In [8]:
#NOTE: This is a movie dialog of scripts, this is just a large dataset to add on to the training.
from convokit import Corpus, download
corpus = Corpus(filename=download("movie-corpus"))

Downloading movie-corpus to C:\Users\GioDude\.convokit\downloads\movie-corpus
Downloading movie-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip (40.9MB)... Done


# Dataset: DND Critical Role Transcripts

In [9]:
#NOTE: This is a dnd transcript
import codecs  # handles various file encodings

with codecs.open("D:\GioDude\Documents\ACC\Spring 2024\Machine Learning II\Datasets\DnD Scripts\cr_dnd_transcripts_1.txt", 'r', encoding='utf-8') as file:
    dnd_data = file.read()

In [10]:
dnd_utterances = dnd_data.splitlines()  # Split into lines (assumes one utterance per line)

In [11]:
for i in range(5):
    all_utterance_ids = corpus.get_utterance_ids()
    random_id = random.choice(all_utterance_ids)
    utterance = corpus.get_utterance(random_id)
    print(utterance.text)

#adds ids and splits shows examples

Very good.
This is a nice place.  It must have cost a pretty penny.
Sure, Mickey.  Sure.
Yes, sir.  They are indeed.
Let's go by Rosarita's. You been there yet?


# spaCy for Tokenization

In [18]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [19]:
def tokenize_sentence(sentence):
    doc = nlp(sentence)
    return [token.text for token in doc]

# ADDING LABELS

In [20]:
#An empty list to store our training data (command, intent) pairs
all_data = []

# Example - D&D Intents
my_intents = ["attack", "move", "jump", "hide", "talk", "cast"]

#Manually Created Examples
all_data.append(("attack the goblin with my sword!", "attack"))
all_data.append(("move north", "move"))
all_data.append(("leap over the chasm!", "jump"))
all_data.append(("sneak behind the enemy", "hide"))
all_data.append(("can I persuade the shopkeeper?", "talk"))
all_data.append(("cast fireball!", "cast"))

#Selective modification of movie dialog to find and label intents
for utterance in corpus.get_utterance_ids(): 
    text = corpus.get_utterance(utterance).text.lower()

    if "attack" in text:
        all_data.append((text, "attack"))
    elif "move" in text or "go" in text: 
        all_data.append((text, "move"))
    elif "jump" in text:
        all_data.append((text, "jump"))
    elif "hide" in text or "go" in text: 
        all_data.append((text, "hide"))
    elif "talk" in text:
        all_data.append((text, "talk"))
    elif "cast" in text:
        all_data.append((text, "cast"))
    else: #if no intent was found
        all_data.append((text, "filler"))

#Selective modification of DND Transcripts to find and label intents
for utterance in dnd_utterances:
    text = utterance.lower() 

    # Simple intent labeling (replace with more selective logic later)
    if "attack" in text:
        all_data.append((text, "attack"))
    elif "move" in text or "go" in text: 
        all_data.append((text, "move"))
    elif "jump" in text:
        all_data.append((text, "jump"))
    elif "hide" in text or "go" in text: 
        all_data.append((text, "hide"))
    elif "talk" in text:
        all_data.append((text, "talk"))
    elif "cast" in text:
        all_data.append((text, "cast"))
    else: #if no intent was found
        all_data.append((text, "filler"))

#NOTE : This can be modified to include other words of intent.

In [None]:
## SIDE NOTE: This was for a seperate model, since RNNs rely on sequential data this was removed
"""
from nltk import word_tokenize 
from sklearn.feature_extraction.text import CountVectorizer
#NOTE: This represents each command as a "bag" of its constituent words, disregarding grammar and word order. The focus is on word frequency.

#### Assuming you have your training data in 'all_data' like the previous examples
sentences = [datapoint[0] for datapoint in all_data]  # Extract just the text sentences
intents = [datapoint[1] for datapoint in all_data]  # Extract the intents (labels)

vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(sentences)  

"""


#### 'X' now contains a matrix where each row is a command and columns represent different words
#### The values in the matrix represent the word counts

# BUILD VOCABULARY

In [21]:
def build_vocabulary(tokenized_data):
    vocabulary = {}  # empty dictionary
    index = 0  # assign indices to each unique token

    for tokens, _ in tokenized_data:  # Ignore intent for now 
        for token in tokens:
            if token not in vocabulary:  # Check if token is new
                vocabulary[token] = index
                index += 1

    return vocabulary

In [29]:
def build_intent_vocabulary(labels): 
    intent_vocab = {}
    index = 0
    for label in labels:
        if label not in intent_vocab:
            intent_vocab[label] = index
            index += 1
    return intent_vocab

# TOKENIZE & ENCODE VOCAB for text and intent(labels)

In [22]:
# Tokenize and build vocabulary
tokenized_data = []
for text, intent in all_data:
    tokens = tokenize_sentence(text)
    tokenized_data.append((tokens, intent)) 

vocabulary = build_vocabulary(tokenized_data) 

In [30]:
# Build intent vocabulary AFTER the loop 
unique_labels = set(intent for _, intent in all_data)   
intent_vocabulary = build_intent_vocabulary(unique_labels)  

In [31]:
vocabulary["<UNK>"] = 0  # Assign the index 0 (or any other suitable index) for unseen words

In [32]:
def encode_sequence(tokens, vocabulary, max_length=None):
    encoded_sequence = [vocabulary.get(token, vocabulary["<UNK>"]) for token in tokens]  

    # (Optional) Padding for equal length 
    if max_length:
        encoded_sequence = encoded_sequence[:max_length]  # Truncate if too long 
        encoded_sequence += [0] * (max_length - len(encoded_sequence))  # Pad with zeros

    return encoded_sequence

In [33]:
def encode_intent(intent, intent_vocabulary):
    return intent_vocabulary[intent] 

In [34]:
# Encoding vocabularies 
encoded_data = []
for tokens, intent in tokenized_data:
    encoded_sequence = encode_sequence(tokens, vocabulary)
    encoded_intent = encode_intent(intent, intent_vocabulary) 
    encoded_data.append((encoded_sequence, encoded_intent))

In [35]:
print(encoded_data[:5])  # Print the first 5 elements 

[([0, 1, 2, 3, 4, 5, 6], 4), ([7, 8], 5), ([9, 10, 1, 11, 6], 1), ([12, 13, 1, 14], 0), ([15, 16, 17, 1, 18, 19], 3)]


# SPLITTING THE DATA

In [36]:
from sklearn.model_selection import train_test_split

# Assuming encoded_data is your list of (encoded_command, encoded_intent) tuples
training_data, validation_and_test_data = train_test_split(encoded_data, test_size=0.3, random_state=42)

# Further split validation and test sets (70% of 30% for validation, 30% of 30% for test)
validation_data, test_data = train_test_split(validation_and_test_data, test_size=0.33, random_state=42) 


In [37]:
#NOTE: Extracting Labels Later (If Needed)
training_commands = [data[0] for data in training_data]
training_labels = [data[1] for data in training_data]

In [38]:
print("Training Data Length:", len(training_data))
print("Validation Data Length:", len(validation_data))
print("Test Data Length:", len(test_data))

Training Data Length: 220721
Validation Data Length: 63379
Test Data Length: 31217


# CRAFT A PyTorch DATASET to streamline RNN's data handling.

# CREATING THE MODEL

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train, y_train) 

In [None]:
from sklearn.metrics import accuracy_score 

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy) 

Accuracy: 0.9739946720791577


In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[    3     0    58     0     0    29     0]
 [    0     0    23     0     0    29     1]
 [   11     6 51787     0     3   440   143]
 [    0     0     8     0     0    28     0]
 [    0     0     8     0     0    32     0]
 [    2     0   190     0     0  9372     8]
 [    0     0    44     0     0   577   262]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

      attack       0.19      0.03      0.06        90
        cast       0.00      0.00      0.00        53
      filler       0.99      0.99      0.99     52390
        hide       0.00      0.00      0.00        36
        jump       0.00      0.00      0.00        40
        move       0.89      0.98      0.93      9572
        talk       0.63      0.30      0.40       883

    accuracy                           0.97     63064
   macro avg       0.39      0.33      0.34     63064
weighted avg       0.97      0.97      0.97     63064



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
