# Overview
***
This notebook gives a brief overview of the data and what I plan to accomplish with said data.
***
## Data Processing Overview
For more exact details, please see [initial_base_data_exploration.ipynb](https://github.com/Data-Science-for-Linguists-2022/Sociolinguistics-In-Video-Games/blob/main/notebooks/initial_base_data_exploration.ipynb) and [`HDialogueParser.py`](https://github.com/Data-Science-for-Linguists-2022/Sociolinguistics-In-Video-Games/blob/main/scripts/HDialogueParser.py). Regardless, my processing can be outlined in the following steps:
1. Fork van Stegeren's [original repository](https://github.com/hmi-utwente/video-game-text-corpora).
2. Extract the three datasets from there.
3. Clean them up, extracting and reformatting the datasets as needed.
4. Save them to `.pkl` files, creating sample `.pkl` files as well.
5. Use `HDialogueParser.py` to extract the dialogue from _Hollow Knight_.
6. Clean up its DataFrame likewise.
7. Save to `.pkl` likewise.
***
## Research Questions
After looking more deeply at the data, I have now formed more well-rounded and established questions that I want to try to answer. Some questions pertain to a specific dataset while others will be (hopefully) answered using some combination of some or all the datasets. They are as follows:
1. How are orders/requests realized in video game dialogues? Are there more direct or indirect orders?
2. What is the frequency of the 2nd person pronoun, _you_? 
    * Extending that, what are the frequencies of other pronouns?
3. What are some common named entities in video games?
    * Extending this, what are some hapaxes related to named entities?
***
## Contents
1. [The Elder Scrolls Overview](#Elder-Scrolls-Series-Overview)
2. [Torchlight II Overview](#Torchlight-II-Overview)
3. [Star Wars: Knights of the Old Republic Overview](#Star-Wars:-Knights-of-the-Old-Republic-Overview)
4. [Hollow Knight Overview](#Hollow-Knight-Overview)
5. [Additional Notes](#Additional-Notes)
***
## Python Imports
Below are the necessary imports to show the established `.pkl` files. All `.pkl` files are the full datasets located in a `.gitignore` private subdirectory.

In [1]:
# Necessary imports
import pandas as pd

# I looked this up so everything is prettier
from IPython.display import display, HTML

CSS = """
.output {
    flex-direction: row;
}
"""

HTML('<style>{}</style>'.format(CSS))

# Use display(dataframe) to show

# FILEPATH
FILEPATH = "../private/pickled_dfs/"

***
## Elder Scrolls Series Overview
### Game Summary
_The Elder Scrolls_ is a series of single-player role-playing game where the player takes control of the main character (usually some prophecized hero) to explore a region of the medieval-style fantasy world _Tamriel_. Created by _Bethesda Game Studios_, the game series first entry _Elder Scrolls: Arena_ was published in 1994 and the datasets most recent entry, _The Elder Scrolls: Online_, was released in 2014. Note, however, the latest entry is not single-player, but a massive multiplayer online game (MMO).
### Data Summary
The dataset itself contains the books which players can find scattered throughout the games' maps. The data was collected from the [Elder Scrolls Fan Wikis](https://elderscrolls.fandom.com/wiki/The_Elder_Scrolls_Wiki). Each entry contains the following information:
* `title`: The book's title
* `author`: The book's (in-game) author (can be anonymous/unknown).
* `description` (optional): A brief description of the book's contents.
* `game`: Which games the book appears in.
* `text`: The book's actual text.
* `word_count`: The book's word count (according to `nltk.word_tokenize`).
* `url`: The url to the webpage parsed for the data.
### DataFrame Summary

In [2]:
# Load the DataFrame from the .pkl file
es_df = pd.read_pickle(FILEPATH + 'elder_scrolls.pkl')
display(es_df.head())
print(es_df.info())

Unnamed: 0,title,author,description,game,text,word_count,url
0,dying-mans-last-words,Indie,The last words of a world-renowned archaeolog...,[Morrowind],It's been many days since the collapse. I have...,394,https://www.imperial-library.info/content/dyin...
1,fair-warning,Cumanya,Fair warning to those who would enter the Grea...,[Morrowind],This being an account of my limited journeys i...,228,https://www.imperial-library.info/content/fair...
2,game-dinner,Anonymous,A published letter from an anonymous spy about...,"[Morrowind, Oblivion, Skyrim]",A GAME AT DINNER\nby\nAn Anonymous Spy\n\nForw...,2705,https://www.imperial-library.info/content/game...
3,hypothetical-treachery,Anthil Morvir,An amusing play about a bunch of backstabbing ...,"[Morrowind, Oblivion, Skyrim]",A Hypothetical Treachery\nA One Act Play\nby\n...,2008,https://www.imperial-library.info/content/hypo...
4,leaflet,Anonymous,Propaganda against the alchemist Aurane Frernis.,[Morrowind],BEWARE!!!!\nHAVE NO DEALINGS with AURANE FRERN...,117,https://www.imperial-library.info/content/leaflet


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5446 entries, 0 to 5445
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        5446 non-null   object
 1   author       5446 non-null   object
 2   description  5446 non-null   object
 3   game         5446 non-null   object
 4   text         5446 non-null   object
 5   word_count   5446 non-null   int64 
 6   url          5446 non-null   object
dtypes: int64(1), object(6)
memory usage: 298.0+ KB
None


In [10]:
# Let's get some other stats
print("===================== OTHER STATS =====================")
print("UNIQUE AUTHORS:", len(es_df['author'].unique()))
print("UNIQUE ENTRIES:", len(es_df['title'].unique()))
print("AVG WORD COUNT:", es_df['word_count'].mean())
print("TOTAL WORD COUNT:", es_df['word_count'].sum())

master_list = list()
for glist in es_df.loc[:, 'game']:
    for game in glist:
        if game not in master_list:
            master_list.append(game)
print("GAMES FEATURED:", ', '.join(master_list))

UNIQUE AUTHORS: 2106
UNIQUE ENTRIES: 5446
AVG WORD COUNT: 351.9120455380095
TOTAL WORD COUNT: 1916513
GAMES FEATURED: Morrowind, Oblivion, Skyrim, TES: Online, Daggerfall, Arena


***
## Torchlight II Overview
### Game Summary
_Torchlight II_ is another single-player role-playing where the player completes quests while exploring randomly generated dungeons. A dungeon is a series of rooms and corridors where the player is pitted against enemies, all in the hope to find items and complete quests. The game was made by _Runic Games_ and released in 2012.
### Data Summary
The dataset was originally split into both a `.csv` and `.json` file. However, I only extracted information from the `.csv` file as it contained the actually dialogue and metadata needed for my research. The `.json` file contained mostly game engine related data. Each entry contains the following information:
* `speaker`: The person/entity giving the quest. "NO SPEAKER" indicates this quest is given through some narration or other format.
* `text`: A text of what the speaker says.
* `word_count`: The text's word count (according to `nltk.word_tokenize`).
* `dialogtype`: How the text is said.
* `quest_displayname`: The in-game name of the quest, "NON-QUEST" means this dialogue is not a part of a standard quest or is given during a quest (assocaited through `quest_name`).
* `quest_name`: The in-engine/in-file name of the quest. "NON-QUEST" entries are hooked up to quests here.
### DataFrame Summary 

In [4]:
# Load the DataFrame from the .pkl file
tl_df = pd.read_pickle(FILEPATH + 'torchlight.pkl')
display(tl_df.head())
print(tl_df.info())

Unnamed: 0,speaker,text,word_count,dialogtype,quest_displayname,quest_name
0,Felicia,You seem weary - perhaps you are in need of a ...,47,complete,The Adventure Continues,global_newgameplus
1,Bounty Board,BOUNTY!\nThe Sturmbeorn raiders are a threat t...,64,intro,Bounty: Sturmbeorn!,a1z1-bountyboard2
2,Bounty Board,The Sturmbeorn have been conducting raids in n...,64,return,Bounty: Sturmbeorn!,a1z1-bountyboard2
3,Bounty Board,The Empire thanks you for your valuable servic...,15,complete,Bounty: Sturmbeorn!,a1z1-bountyboard2
4,NO SPEAKER,"Kill 10 Sturmbeorn in the Temple Steppes, then...",22,details,Bounty: Sturmbeorn!,a1z1-bountyboard2


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1008 entries, 0 to 1007
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   speaker            1008 non-null   object
 1   text               1008 non-null   object
 2   word_count         1008 non-null   int64 
 3   dialogtype         1008 non-null   object
 4   quest_displayname  1008 non-null   object
 5   quest_name         1008 non-null   object
dtypes: int64(1), object(5)
memory usage: 47.4+ KB
None


In [11]:
print("===================== OTHER STATS =====================")
print("QUESTS FROM NO SPEAKER:", len(tl_df[tl_df['speaker'] == 'NO SPEAKER']))
print("NUMBER OF UNIQUE SPEAKERS:", len(tl_df['speaker'].unique()))
print("NUMBER OF IN-FILE QUESTS:", len(tl_df['quest_name'].unique()))
print("NUMBER OF IN-GAME QUESTS:", len(tl_df['quest_displayname'].unique()))
print("DIALOGUE TYPES: ", len(tl_df['dialogtype'].unique()))
print("AVG WORD COUNT:", tl_df['word_count'].mean())
print("TOTAL WORD COUNT:", tl_df['word_count'].sum())

QUESTS FROM NO SPEAKER: 348
NUMBER OF UNIQUE SPEAKERS: 84
NUMBER OF IN-FILE QUESTS: 131
NUMBER OF IN-GAME QUESTS: 90
DIALOGUE TYPES:  9
AVG WORD COUNT: 33.9156746031746
TOTAL WORD COUNT: 34187


***
## Star Wars: Knight of the Old Republic Overview
### Game Summary
_Star Wars: Knights of the Old Republic_ is a 2003 single-player role-playing game released by _Bioware_ and published through _LucasArts_. The player plays as Revan and traverses the _Star Wars_ univerise during the age of the Old Republic (set before the prequel trilogy). This game is commonly abbreviated as _KOTOR_/_Kotor_, and as such, further mentions of this game in this and other notebooks/markdown files will follow suit.
### Data Summary
This dataset was comprised of two `.csv` files: one containing the actual dialogue, and the other containing metadata regarding the animations which the characters did during the dialogue. For context, game developers usually employ a set amount of animations which all/most character models can do. This allows them to mass-apply certain animations throughout the game and save time and development costs. Each entry contains the following for `kotor.pkl`:
* `speaker`: The speaker of the dialogue.
* `listener`: Who is being directly addressed by the speaker. This serves as important sociolinguistic information.
* `text`: What the speaker says in text format.
* `word_count`: The text's word count (according to `nltk.word_tokenize`).
* `animation`: A Python List of animations which play during the dialogue.
* `next`: A Python List of the next chunks of dialogue which occur.
* `previous`: A Python List of the previous chunks of dialogue which occurred in that cutscene/instance.
* `comment`: Comment(s) left by the developers in the in-game text files.

Each entry for the metadata file `meta_kotor.pkl` contains the following:
* `name`: Animation name.
* `looping`: Can/does this animation loop.
* `fireforget`: I am waiting for clarification from the original repository's creator, the one who made the dataset, but I believe this indicates whether or not this animation cancels the NPC being in the attack state.
* `dialog`: Is this a dialogue animation (can this play during dialogue).
* `cu_pb_range`: Another one I am unsure of. I believe this either relates to the time duration of the animation (in seconds), or the time it takes to transition into the animation from another animation (in seconds).
### DataFrames Summary

In [6]:
# Load the DataFrame from the .pkl file
kotor_df = pd.read_pickle(FILEPATH + 'kotor.pkl')
display(kotor_df.head())
print(kotor_df.info())

Unnamed: 0,speaker,listener,text,word_count,animation,next,previous,comment
0,Anchorhead Tradesman,NO LISTENER,Take care of yourself. The price of kolto tank...,16,[],[],[None],NO COMMENT
1,Anchorhead Tradesman,NO LISTENER,The Selkath put a bunch of export restrictions...,18,[],[],[None],NO COMMENT
2,Anchorhead Tradesman,NO LISTENER,I hear that Manaan is no longer shipping kolto...,20,[],[],[None],NO COMMENT
3,Anchorhead Tradesman,NO LISTENER,"If you have kolto tanks, use them sparingly. I...",18,[],[],[None],NO COMMENT
4,Anchorhead Tradesman,NO LISTENER,I'm sure I saw some holo-footage of you on the...,20,[],[],[None],NO COMMENT


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29213 entries, 0 to 29212
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   speaker     29213 non-null  object
 1   listener    29213 non-null  object
 2   text        29213 non-null  object
 3   word_count  29213 non-null  int64 
 4   animation   29213 non-null  object
 5   next        29213 non-null  object
 6   previous    29213 non-null  object
 7   comment     29213 non-null  object
dtypes: int64(1), object(7)
memory usage: 1.8+ MB
None


In [7]:
# Load the DataFrame from the .pkl file
meta_df = pd.read_pickle(FILEPATH + 'meta_kotor.pkl')
display(meta_df.head())
print(meta_df.info())

Unnamed: 0,name,looping,fireforget,dialog,overlay,cu_pb_range
0,Dead,1,0,1,0,0.0
1,Taunt,0,1,1,0,0.8
2,Greeting,0,1,1,0,0.8
3,Listen,1,0,1,1,0.0
4,Worship,0,1,1,0,2.6


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         31 non-null     object 
 1   looping      31 non-null     int64  
 2   fireforget   31 non-null     int64  
 3   dialog       31 non-null     int64  
 4   overlay      31 non-null     int64  
 5   cu_pb_range  31 non-null     float64
dtypes: float64(1), int64(4), object(1)
memory usage: 1.6+ KB
None


In [13]:
print("===================== OTHER STATS =====================")
print("NUMBER OF UNIQUE SPEAKERS:", len(kotor_df['speaker'].unique()))
print("NUMBER OF UNIQUE LISTENERS:", len(kotor_df['listener'].unique()))
print("NUMBER OF TIMES PLAYER IS INTENDED LISTENER:", len(kotor_df[kotor_df['listener'] == 'PLAYER']))
print("LIST OF ANIMATIONS:", ', '.join(list(meta_df['name'])))
print("AVG WORD COUNT:", kotor_df['word_count'].mean())
print("TOTAL WORD COUNT:", kotor_df['word_count'].sum())

NUMBER OF UNIQUE SPEAKERS: 538
NUMBER OF UNIQUE LISTENERS: 152
NUMBER OF TIMES PLAYER IS INTENDED LISTENER: 1331
LIST OF ANIMATIONS: Dead, Taunt, Greeting, Listen, Worship, Salute, Bow, Talk_Normal, Talk_Pleading, Talk_Forceful, Talk_Laughing, Talk_Sad, Victory, Scratch_Head, Drunk, Inject, Flirt, Use_Computer_LP, Horror, Use_Computer, Persuade, Activate, Sleep, Prone, Ready, Pause, Choked, Talk_Injured, Listen_Injured, Kneel_Talk_Angry, Kneel_Talk_Sad
AVG WORD COUNT: 19.075788176496765
TOTAL WORD COUNT: 557261


***
## Hollow Knight Overview
### Game Summary
_Hollow Knight_ is a [Metroidvania](https://en.wikipedia.org/wiki/Metroidvania#:~:text=Metroidvania%20is%20a%20subgenre%20of,genre%20borrowing%20from%20both%20series.) styled game made by _Team Cherry_ and released early 2017. The player controls _The Knight_ as they make their way through the dark fantasy world. What's interesting about this game is that it was made by 3 people, with the writing being done by 2 of those individuals in a Google Doc. So, I believe this will be an amazing dataset to explore the differences between professional video game writing and amateur writing (although the writing is still good nonetheless).
### Data Summary
I gathered the data from a fan-made [Google Document](https://docs.google.com/document/d/1oaED7I6xL5NItD-wKyDB455f58d3weLz8OMIkRyEQlo/edit#heading=h.wgd1af4mikjx) containing every single textual item in the game. I don't know why they did that as I know they did it completely by hand. The document is over 200 pages long and extremely well-formatted. I used [`HDialogueParser.py`](https://github.com/Data-Science-for-Linguists-2022/Sociolinguistics-In-Video-Games/blob/main/scripts/HDialogueParser.py) with the following techniques:
1. Copy the original Google Document, simplifying the non-textual and header elements. Get both an `.html` and `.txt` version.
2. Use `beautifulsoup` to parse the `.html` file, getting the characters and main descriptions from each section.
3. Use the `.txt` file to create a tagged-style annotated `.txt` file which was then used to create the DataFrame.

Each entry contains the following:
* `character`: The character's name.
* `text`: A text dump of all the character's dialogue, split into chunks via description tags. However, there can be further sub-descriptions in each part. I am... **extremely** familiar with this dataset, so I am able to tell what is dialogue and what is a sub-description (although, in general, it's pretty obvious). Therefore, I recommend only people familiar with the game use this dataset. You may also refer to the original Google Document to see what is and isn't a subdescription.
* `word_count`: The text's word count (according to `nltk.word_tokenize`).
### DataFrame Summary

In [9]:
# Load the DataFrame from the .pkl file
hollow_df = pd.read_pickle(FILEPATH + 'hollow_knight.pkl')
display(hollow_df.head())
print(hollow_df.info())

Unnamed: 0,character,text,word_count
0,Confessor Jiji,"\n\n\n\n\nFirst encounter\nWelcome, small intr...",1035
1,Iselda,"\n\n\n\n\n\n\nA map of the Ancient Basin, a la...",1643
2,Leg Eater,\n\n\nFragile Heart\nThis is a precious thing....,819
3,Little Fool,\n\n\n\n\nFirst encounter\nAha! Another warrio...,593
4,Millibelle the Banker/Millibelle the Thief,\n\n\n\n\n\n\nHello there dearie. I was about ...,631


<class 'pandas.core.frame.DataFrame'>
Int64Index: 55 entries, 0 to 54
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   character   55 non-null     object
 1   text        55 non-null     object
 2   word_count  55 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 1.7+ KB
None


In [15]:
print("===================== OTHER STATS =====================")
print("AVG WORD COUNT:", hollow_df['word_count'].mean())
print("TOTAL WORD COUNT:", hollow_df['word_count'].sum())

AVG WORD COUNT: 756.2545454545455
TOTAL WORD COUNT: 41594


***
## Additional Notes:
All datasets will be used in the answering of my research questions. However, here are some highlights regarding each question and some hypothesis/guesses with the datasets.
1. For this question, the _Hollow Knight_, _Torchlight II_, and _Kotor_ datasets will be perfect for this (also _The Dishonored Series_ if I can get it). _The Elder Scrolls_ data contains actual books published in-game, so I don't expect to get much out of them, if anything with this question. I hypothesize that the majority of discourse in-game involved the player in some way, shape, or form. I also hypothesize that **all** dialogue is tailored to help the player or immerse the player.
2. All datasets will be useful here. However, I suspect much dialogue writing can be thought of as a template:
    I need {x item}, but {y boss/creature} is guarding it. Can you help me?
Or something similar. 
3. I suspect general world-building text/dialogue is more free form and less restricted to its genre. I also suspect there to be sociolinguistic trends when certain characters speak/write.
4. I already know this one is true, but it will be nice to capture it with real data. In particular, with _The Elder Scrolls_ series, the games' "races" all derive heavily from real-world cultures. There is also a very obvious set of linguistic traits associated with one of these races that are explained with pseudo-linguistics. These traits show up in the books written by that race. I am going to have a lot of fun with this one.
5. I hope I can make meaning comparisons between the _Hollow Knight_ dataset against all the other datasets in some way. I hypothesize that the dialogue written in _Hollow Knight_ is much more directed towards the player with very little non-player-related dialogue due to the game's genre and team size. I might also suspect the quality of the writing differs, but I am unsure how to meaningully compare it when all groups involved are native English speakers.

I think these datasets and games will be perfect for the questions I am trying to answer. I expect my end-game data to be mostly examples of certain discourse/pragmatic features gotten from the DataFrames and charts highlighting trends in each DataFrame (word counts relating to quest type, trends in hapaxes frequency, etc). This project is definitely going to be analysis heavy, but I also don't expect there to be too many graphs, being more focused on case studies and comparisons within the datasets.