# 03-01 : Identify Game Version

In this notebook we will attempt to identify the three types of game versions that are available in the training data. Based on the work done by [steubk](https://www.kaggle.com/code/steubk/meetings-are-boring-the-notebook)

The game types include:

- 0 dry: no humor or snark
- 1 nohumor: no humor (includes snark)
- 2 nosnark: no snark (includes humor). No snark can also be thought of as "obedient"
- 3 normal: base script (includes snark and humor)


In [1]:
import sys
import logging
from typing import Iterable, List, Tuple, Dict

import pandas as pd
import numpy as np

from tqdm.auto import tqdm
import matplotlib.pyplot as plt

### Configure Logging

In [2]:
logging.basicConfig(
    format='%(asctime)s %(levelname)-8s %(message)s',
    level=logging.INFO,
    datefmt='%Y-%m-%d %H:%M:%S',
        handlers=[
        logging.StreamHandler(sys.stdout)
    ])

logging.info("Started")

2023-03-21 16:13:50 INFO     Started


## Data Collection

In [3]:
dtypes = {
    "session_id": np.int64,
    "elapsed_time": np.int32,
    "event_name": "category",
    "name": "category",
    "level": np.uint8,
    "page": "category",
    "room_coor_x": np.float32,
    "room_coor_y": np.float32,
    "screen_coor_x": np.float32,
    "screen_coor_y": np.float32,
    "hover_duration": np.float32,
    "text": "category",
    "fqid": "category",
    "room_fqid": "category",
    "text_fqid": "category",
    "fullscreen": "category",
    "hq": "category",
    "music": "category",
    "level_group": "category",
}

In [4]:
# load the source training set
df_source = pd.read_csv('data/competition/train.csv.gz', 
                        compression='gzip',
                        dtype=dtypes)

print(df_source.shape)
with pd.option_context('display.max_columns', None):
    display(df_source.head(3))

(26296946, 20)


Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,0,0,cutscene_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4
1,20090312431273200,1,1323,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
2,20090312431273200,2,831,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4


In [5]:
# load the source training labels
df_source_labels = pd.read_csv('data/competition/train_labels.csv')

print(df_source_labels.shape)
with pd.option_context('display.max_columns', None):
    display(df_source_labels.head(3))

(424116, 2)


Unnamed: 0,session_id,correct
0,20090312431273200_q1,1
1,20090312433251036_q1,0
2,20090312455206810_q1,1


In [6]:
# load the game versions data
df_game_versions = pd.read_csv('data/game_versions/game_type_text_translation.csv')

print(df_game_versions.shape)
with pd.option_context('display.max_columns', None):
    display(df_game_versions.head(3))

(76, 4)


Unnamed: 0,normal,dry,nohumor,nosnark
0,Gah. I can't believe this.,Gah. I can't believe this.,I can't believe this.,Gah. I can't believe this.
1,Then who took Teddy???,Then who took Teddy?,Then who took Teddy?,Then who took Teddy???
2,"Um, are you okay?","Um, are you okay?","Um, are you okay?",Are you okay?


## Extend the Game Version Data

In [7]:
def extend_game_version_data(
        X:pd.DataFrame,
        versions_data:pd.DataFrame) -> Dict[str, pd.DataFrame]:
    version_datasets = {}

    # get a unique list of text values and their level
    df_text_levels = X[['text', 'level']].drop_duplicates()

    # process each game version separately
    for game_version in tqdm(versions_data.columns):
        print(f"Processing game version: {game_version}")
        version_datasets[game_version] = versions_data[[game_version]]

        # merge the level from the source data with the game version data
        version_datasets[game_version] = version_datasets[game_version].merge(
            df_text_levels,
            how='left',
            left_on=game_version,
            right_on='text') \
            .drop(columns=[game_version]) \
            .drop_duplicates(subset='text', keep='first') \
            .dropna() \
            .reset_index(drop=True)

    return version_datasets

# test the function
version = 'nosnark'
extended_versions = extend_game_version_data(df_source, df_game_versions)
print(extended_versions[version].shape)

with pd.option_context('display.max_columns', None):
    display(extended_versions[version].sort_values('level').head(50))

  0%|          | 0/4 [00:00<?, ?it/s]

Processing game version: normal
Processing game version: dry
Processing game version: nohumor
Processing game version: nosnark
(58, 2)


Unnamed: 0,text,level
36,I get to go to Gramps's meeting!,0.0
55,"Sure thing, Jo. Grab your notebook and come up...",0.0
51,"See you later, Teddy.",0.0
47,"Can I come, Gramps?",0.0
32,Yes! This cool old slip from 1916.,2.0
49,Hot Dog! I knew it!,2.0
10,"Ooh, I like clues!",2.0
26,Why don't you go catch up with your grampa?,2.0
40,Hopefully you can rustle up some clues!,3.0
39,"What are you still doing here, Jolie?",5.0


In [8]:
# for each version in extended_versions add a new column to indicate if the text occurs
# in any of the other versions
for version in extended_versions:
    for other_version in extended_versions:
        if version != other_version:
            extended_versions[version][f"{other_version}_exists"] = \
                extended_versions[version]['text'] \
                .isin(extended_versions[other_version]['text'])
            
        # add a single column to indicate if the text occurs in any of the other versions
        extended_versions[version]['any_other_version_exists'] = \
            extended_versions[version].filter(regex='.*_exists').any(axis=1)

# test the function
for version in extended_versions:
    unique_text = extended_versions[version] \
        .query('any_other_version_exists == False') \
        .text.values.to_numpy()

    print(f'\'{version}\': ', end='')
    print(unique_text)
    print()

            

'normal': ["Actually, we're just here for some photos." "Fine, fine. Let's see..."
 'And we still need to figure out that flag!'
 'I need to find Wells right away!! Do you know where he is?'
 'Meetings are BORING!']

'dry': ['Sure!' 'Yes! This old slip from 1916.']

'nohumor': ['Do I have to?'
 'I need to find Wells right away! Do you know where he is?'
 'Ugh. Meetings are so boring.']

'nosnark': ['Are you okay?' "Don't worry, Teddy won't eat your lunch anymore!"
 'Can I ride with you?'
 "He's always trying to get you in trouble, and he doesn't like animals!"
 "We're just looking for photos for the flag display."
 "Don't worry, he won't! (And he's a badger, by the way.)"
 'Yes! This cool old slip from 1916.']

