# Processing a single Story 

In this exposition, my main is to preprocess a single story in a way that we can send it to our model. Considering the complexity of the storium data, this is an extensive task. Let us see what it involves. Firstly, we have defined some useful datastructures that we would need for preprocessing in **datastructures.py**. Let me elaborate upon the nature of these. We have three of them: 

1. **Class Trim**
2. **Class IndexedSet** 
3. **Class IndexedDic** 



### Imports & Datafiles 

In [3]:
import argparse
import bisect
import glob
import heapq
import json
import logging
import os
from enum import Enum, auto, unique
from numbers import Number
from typing import (Any, Dict, Iterable, List, Mapping, Optional, Sequence,
                    Tuple, TypeVar, Union)

import torch
import zlib
from contextlib import contextmanager
from dataclasses import dataclass
from itertools import islice, zip_longest
from typing import (Any, Dict, Iterable, List, Optional, Sequence, Tuple,
                    TypeVar, Union)

from kiwisolver import Constraint, Solver, Variable, strength
from transformers import AutoTokenizer, PreTrainedTokenizer


<details>    
<summary>
    <font size="5" color="purple"><b> Understanding Trim, IndexedDict and IndexedSet functionalities </b></font>
</summary>
<p>
<ul>
- The **trim class** defines how a data sequence should be trimmed it is exceeds a certain length. This is useful for us because we are going to be sending lengths which would far exceed the maximum capacity of the tokenizer (more than 1024 words for GPT-2. Hence, the trim class will be used to trim the segments which become too long 
    
- The **IndexedSet** basically allows only unique elements to remain inside. However, it is distinct from the normal set class in that it allows for efficient indexing and insertion operations (so it maintains the order of the elements unlike the normal set dataclass)
    
- While a standard dictionary in Python is accessed via keys, **IndexedDict** allows acces via integer indices as well. 

</ul>
</p>

 

In [5]:
from datastructures import (
    IndexedDict,
    IndexedSet,
    Trim,
)

# Segments 

As we talked about in our previous exposition, the main uniqueness of the dataset are segments itself. These bring different cards together and effectively preprocesses them so that we can send them inside our model. Now, in data exploration, we talked about constraints and how these can be implemented using Cassoway Algorithm. Constraints are essential because they offer a way to control the length and contents of the segments. Since many of the models that we would be finetuning with allow a fixed context window, these constraints will allow us to flexibiliy change the context window. We don't need to worry about the creation of segments as the process itself is streamlined in the github repository associated with the dataset. A high-level overview would be fine for the project we are considering (which builds on the original work as we talked about).




</ul>
</p>

<details>    
<summary>
    <font size="5" color="purple"><b>Overview of Segments </b></font>
</summary>
<p>
<ul>
 What we need to define are the following: 
    
    
- **Segment_ids** : these will associate a unique ID with each segment.
- **EOS separator** : this will represent special tokens like a separator or end-of-sequence token.
- **trim, constrained**, **naive_max_length**, **preferred_length**, **length:** will control how the segment is trimmed and constrained in length.

As for constraint management, methods like hard_constraints, medium_constraints, and strong_constraints define various levels of constraints that can be applied to the segment's length, providing a flexible mechanism for controlling segment size and composition.


</ul>
</p>

In [6]:
from segments import (
    Segment,
)

# Story Flow and Preprocessing

These are essential parts of the code that we need to understand if we want to implement and build on the dataset as our project requires. Therefore, I have decided to bring many of the functionalities out of the general preprocessing class and see how each of the story is being preprocessed at every step of the way. On my way, I have modified some of the functions so that their logic is clear to everyone in the group.  My main aim with this is to see step-by-step how each story is being processed. This will ensure that we are familiar with the dataset to such extent that we would not have any difficulty when we actually implement the project.  

Firstly, let me talk about the class **SpecialToken**. This one is more important than others conceptually. We know that during finetuning, we are going to sending across various cards such as **character cards**, **scenes*, **establishment** entries. The question arises as to how will our model be able to meaningfully seperate between these? (afterall at the end your just feeding numbers to the model. There has to be some correlation in those numbers that your model needs to pick on for it to meaningfully learn). 

Well, we can use special tokens to do this. Special Tokens identify parts of narrative like characters and actions. In particular, we can sandwitch cards between special tokens unique to them. For example, if we define a token like character, we will enclose it with the value *<|CHARACTER|>* 
... 

*The question of how effective this strategy is should be explored. That is, how best to demarcate our data so the model can best pick up on various correlations*



In [7]:
@unique
class SpecialToken(str, Enum):
    """
    An enumeration of special tokens
    """

    # This method must be defined before calling "auto()"
    # pylint:disable=unused-argument,no-self-argument
    def _generate_next_value_(name, start, count, last_values):
        """
        Automatically create the enumeration name
        """
        return f"<|{name.upper()}|>"

    # pylint:enable=unused-argument,no-self-argument

    @classmethod
    def from_string(cls, name: str):
        """
        Get the associated SpecialToken from the passed in string
        """
        if name == "name":
            # Handle this specially since Enum defines "name" as a string, but
            # we want to use it to extract the field from the data
            name = "name_field"

        return cls(f"<|{name.upper()}|>")

    missing = auto()  # denotes missing information
    separator = auto()  # separator token at the beginning of each Segment
    character = auto()  # a character's biography

    # All the possible card types from <CardNamespace> in the Storium export
    # format: https://storium.com/help/export/json/0.9.2
    chartype = auto()
    goal = auto()
    person = auto()
    place = auto()
    thing = auto()
    strength = auto()
    weakness = auto()
    obstacle = auto()
    subplot = auto()

    # generic attributes for cards, characters, etc
    name_field = auto()  # cannot call it "name", as Enum defines it as well
    description = auto()

    # Contextual card attributes
    failure_stakes = auto()
    success_stakes = auto()

    # Information denoting entry type
    move = auto()
    establishment = auto()
    addition = auto()
    conclusion = auto()

    # some notions of ordering
    previous = auto()  # can stack, e.g. previous + previous => timestep t-2

    # some notions of authorship
    narrator = auto()  # can stack,  e.g. previous + narrator
    same_character = auto()  # can stack,  e.g. previous + same_character
    diff_character = auto()  #  can stack,  e.g. previous + diff_character

    def __str__(self):
        """
        Override the default string method to return the enumeration value,
        which is a string
        """
        return self.value


# Encode & Encode Special function 

The **encode function** is pretty standard. You can think of it as performing normal tokenization as must be carried for all finetuning tasks. The authors have also added a functionality that suppress warnings related to text length exceeding the maximum length allowed by the tokenizer. However, all in all, its a pretty normal function. Let me briefly mention the **extract_string** function in this context. It is called again and again when we are tokenizing text. What it does is that it retrieves a string value from a dictionary based on a specified field name. If the field exists in the dictionary and its corresponding value is not None, the function returns that value as a string. 


**The Encode Special** is of significant interest. Basically, the *encode function extends the encode function by adding functionality to handle special tokens and create a Segment.* Hence, after tokenization has occured, the process of **segmentation** as discussed in the original paper is carried out by encode_special. 




In [8]:
def encode(
    string_or_list: Union[str, List[str]],
    tokenizer,
    max_length: int

) -> List[int]:
    """
    PreTrainedTokenizer.encode outputs warnings if the text being tokenized
    is longer than the max_length specified in the tokenizer.
    Unfortunately, the order of operations is to warn first, then truncate
    to the max length that was passed in, resulting in spurious warnings,
    so we wrap the function to suppress these warning messages.
    """
    logger = logging.getLogger(PreTrainedTokenizer.__module__)
    log_level = logger.getEffectiveLevel()
    logger.setLevel(logging.ERROR)

    # Check if string_or_list is a list and join it if necessary
    text_to_encode = " ".join(string_or_list) if isinstance(string_or_list, list) else string_or_list
    
    # ================================ Unprint the line to see the text we are encoding ==================== # 
    
    #print(text_to_encode)
    
    # ======================================================================================================
    if max_length is not None:
        tokens = tokenizer.encode(
            text_to_encode , max_length=1024  
        )
    else:
        tokens = tokenizer.encode(text_to_encode)
        
    logger.setLevel(log_level)
    return tokens

In [9]:
def encode_special(
    string_or_list: Union[str, List[str]],
    tokenizer,
    special_token: Optional[SpecialToken] = None,
    separator_token_id: Optional[int] = None,
    eos_token_id: Optional[int] = None,
    preferred_length: int = 0,
    trim: Trim = Trim.end,
) -> Segment:
        """
        After encoding with the tokenizer, this creates, create a Segment and
        assign the special_token if specified.
        """
        return Segment(
            encode(string_or_list,tokenizer,1024), # WE NOW PASS TOKENIZER AND MAX LENGTH 
                                                    # EXPLICITLY
            separator=tokenizer.convert_tokens_to_ids(SpecialToken.separator)
            if separator_token_id is None
            else separator_token_id,
            eos=eos_token_id,
            segment_ids=[tokenizer.convert_tokens_to_ids(special_token)]
            if special_token
            else tuple(),
            preferred_length=preferred_length,
            trim=trim,
        )

In [13]:
def extract_string(
    field: str, mapping: Dict[str, Any], default: str = SpecialToken.missing.value
) -> str:
    """
    Extract the given string field, accounting for the potential that it is
    specified as None
    """
    # ============================================================= #
    #print("FLOW OF COMPUTATION 19 ")
    # ============================================================= #

    return mapping.get(field, default) or default

# Dataclasses 

The code has dataclasses. Basically when our entire data would be processed, it would be in the form of these dataclasses. 

1. **ProcessedStory** - Our entire story would be in the form of this dataclass. It contains three things: 
    - *characters*
    - *entries*
    - *establishment_entries* 
- These would serve as our main cards. The entries and establishment entries would be in the form of **Entry Info** and character entries would be in the form of **Character Info**. Notice that they are in the form of IndexedDict.

2. *Entry Info* -  Container for entries and establishment entries. 
2. *Character Info* - Container for characters.  

In [14]:
@dataclass
class CharacterInfo:
    """
    The processed character info
    """
    
    # ============================================================= #
    #print("Inside CharacterInfo")
    # ============================================================= #

    summary: Segment
    character_id: str
    checksum: int

    # This is a sorted list of entry ids written by the character to
    # allow easily looking up the previous entries for the character
    entry_ids: IndexedSet

In [15]:
@dataclass
class EntryInfo:
    """
    The processed entry info
    """


    entry_id: str
    character_id: str
    establishment_id: str
    checksum: int
    text: Segment
    summary: Segment


In [16]:
@dataclass
class ProcessedStory:
    """
    This defines the structure of a story after processing
    """


    
    game_id: str

    # A mapping of character id to character info
    characters: IndexedDict[CharacterInfo]

    # A mapping of entry id to entry info
    entries: IndexedDict[EntryInfo]

    # A mapping of entry id to establishment's entry info
    establishment_entries: IndexedDict[EntryInfo]


# Checksums 

For each card, there are checksums. This allows for selective reprocessing of data. Only the data that has changed (as indicated by a changed checksum) needs to be reprocessed and re-encoded for training. Hence, with entry associated card, we have a checksum. 

In [31]:
def checksum_card(card: Optional[Dict[str, Any]], checksum: int = 1) -> int:
    """
    Checksum the card.
    """
    if not card:
        return checksum

    for field in ("name", "description", "success_stakes", "failure_stakes"):
        checksum = zlib.adler32(
            extract_string(field, card).encode("utf-8"), checksum
        )

    return checksum

In [32]:
def checksum_cards(cards: List[Dict[str, Any]], checksum: int = 1) -> int:
    """
    Create the summary of a card
    """
    for card in cards:
        checksum = checksum_card(card, checksum)

    return checksum


In [33]:
def checksum_character(character: Dict[str, Any], character_id: str) -> int:
    """
    Compute a checksum of a character
    """
    checksum = zlib.adler32(character_id.encode("utf-8"))
    for field in ("name", "description"):
        checksum = zlib.adler32(
            extract_string(field, character).encode("utf-8"), checksum
        )

    return checksum

In [34]:
def checksum_entry(entry: Dict[str, Any], entry_id: str) -> int:
    """
    Compute a checksum of an entry
    """
    checksum = zlib.adler32(entry_id.encode("utf-8"))
    entry_type = entry["format"]
    if entry_type == "move":
        checksum = checksum_card(entry.get("target_challenge_card"), checksum)
        checksum = checksum_cards(
            entry.get("cards_played_on_challenge", []), checksum
        )
    elif entry_type == "establishment":
        checksum = checksum_card(entry.get("place_card"), checksum)
    elif entry_type == "addition":
        checksum = checksum_cards(entry.get("challenge_cards", []), checksum)

    return zlib.adler32(
        extract_string("description", entry, "").encode("utf-8"), checksum
    )

# Summaries of Cards

These are the actual functions that preprocesses each entry and tokenizes them. 

In [35]:
def summarize_character(character: Dict[str, Any], tokenizer) -> Segment:
    """
    Create the summary for a character
    """
    name_encoded = encode_special(
        extract_string("name", character),
        tokenizer,
        SpecialToken.from_string("name"),
        separator_token_id=tokenizer.bos_token_id,
    )
    
    description_encoded = encode_special(
        extract_string("description", character),
        tokenizer,
        SpecialToken.from_string("description"),
    )
    
    encoded_fields = [name_encoded, description_encoded]
    
    return Segment(
        iter(encoded_fields),
        segment_ids=[tokenizer.convert_tokens_to_ids(SpecialToken.character)],
    )

In [36]:
def summarize_card(tokenizer, card: Optional[Dict[str, Any]]) -> Segment:
    """
    Create the summary of a card.

    If it's a challenge card, then it'll have "success_stakes" and
    "failure_stakes" as well.
    """
    if not card:
        return Segment()

    return Segment(
        iter(
            encode_special(
                string_or_list=extract_string(field, card),
                tokenizer=tokenizer,
                special_token=SpecialToken.from_string(field),
            )
            for field in ("name", "description", "success_stakes", "failure_stakes")
            if card.get(field)
        ),
        segment_ids=tuple(
            tokenizer.convert_tokens_to_ids(
                (SpecialToken.from_string(card["namespace"]),)
            ),
        ),
    )

In [37]:
def summarize_cards(tokenizer,cards: List[Dict[str, Any]]) -> Segment:
    """
    Create the summary of a card
    """
    return Segment(iter(summarize_card(tokenizer,card) for card in cards))

In [38]:
def summarize_entry(tokenizer,entry: Dict[str, Any]) -> Segment:
    """
    Create the summary of an entry
    """
    summary = []
    entry_type = entry["format"]
    if entry_type == "move":
        challenge = summarize_card(tokenizer,entry.get("target_challenge_card"))
        if challenge:
            summary.append(challenge)

        cards = summarize_cards(tokenizer,entry.get("cards_played_on_challenge", []))
        if cards:
            summary.append(cards)
    elif entry_type == "establishment":
        place = summarize_card(tokenizer,entry.get("place_card"))
        if place:
            summary.append(place)
    elif entry_type == "addition":
        cards = summarize_cards(tokenizer, entry.get("challenge_cards", []))
        if cards:
            summary.append(cards)

    return Segment(
        summary,
        segment_ids=tuple(
            tokenizer.convert_tokens_to_ids(
                (SpecialToken.from_string(entry_type),)
            ),
        ),
    )

# The story details 

We now consider the main story processing. This is done by **process_story** function. The workflow of this function is the following:  
1. **Extract Scenes and Characters:** It starts by extracting scenes and characters from the story dictionary. If these are not present or not in the correct format, the function returns the processed object if it exists, effectively skipping processing.
2. **Initialize Character List:**  A list of characters is initialized, starting with a default narrator character entry, which is always present in Storium stories but without a detailed summary (it has a checksum of 0, an empty entry_ids set, and an empty Segment as summary).

3. **Process Characters:** Iterate over each character in the characters list. Generate a character_id from the character_seq_id and prefix it with character:.

4. **Process Scenes and Entries:** : Iterate over each scene in scenes, and within each, iterate over its entries. For each entry, compute its checksum and determine if it needs processing based on whether it has changed from the previously processed version. Process the entry using process_entry, which tokenizes and structures the entry's text, and associates it with the relevant character and scene information.

5. **Construct ProcessedStory Object:** Compile the processed data into a ProcessedStory object, containing the structured data for the entire story, including mappings of characters and entries.

The function utilizes another function called **process_entry**. The process_entry function is designed to process a single entry in a narrative or dataset, such as a character's action or a segment of a story, and encapsulate the processed data into an EntryInfo object


In [39]:
# for now I set the preffered length as a hyperparameter manually. 
# It is set to be 256

def process_entry(
    tokenizer,
    entry: Dict[str, Any],
    establishment_id: str,
    checksum: int,
    add_eos: bool = True,
    force: bool = False,
) -> Optional[EntryInfo]:
    """
    Process a character entry
    """
    ###  SETTING HYPERPARAMETERS ON MY OWN ############
    preferred_entry_length = 256 
    
    ############################################
    
    text = extract_string("description", entry, "")
    if not text and not force and entry.get("format") != "establishment":
        # Only modeling moves with written text, though make a special
        # exception for establishment entries. While they are currently
        # required to have text, it seems at some point there were games that
        # didn't have any text for the establishment entry, though it would still
        # have place cards.
        return None

    
    encoded_text = encode_special(
        text,
        tokenizer=tokenizer,
        special_token=SpecialToken.from_string(entry["format"]),
        preferred_length=preferred_entry_length,
        eos_token_id=tokenizer.eos_token_id if add_eos else None,
    )
    summary = summarize_entry(tokenizer,entry)
    if not summary:
        summary = encode_special(
            string_or_list=text,
            tokenizer=tokenizer,
            special_token=SpecialToken.from_string(entry["format"]),
            trim=Trim.start,  # Treat the end of the entry text as a summary
        )

    return EntryInfo(
        checksum=checksum,
        entry_id=entry["seq_id"],
        character_id=entry["role"],
        establishment_id=establishment_id,
        text=encoded_text,
        summary=summary,
    )

# Process_story

In [40]:
def process_story(story: Dict[str, Any], tokenizer, processed: Optional[ProcessedStory] = None) -> Optional[ProcessedStory]:


    # HERE WE OBTAIN THE SCENES 
    scenes = story.get("scenes")
    print("the total number of scenes are ", len(scenes))
    
    # HERE WE OBTAIN ALL THE CHARACTERS 
    characters = story.get("characters")
    print("the total number of characters are ", len(scenes))

    # If either scenes or characters are missing, or scenes is not a proper sequence,
    # we return previously processed data if available
    if not scenes or not characters or not isinstance(scenes, Sequence):
        return processed
    
    #We now create the character_list. To do this, we first sort the entry_ids using indexedSet(). The character_id
    # is set to the narrative. To obtain the actual contents which is found in summary, we send it to the Segment ()
    # class. 

    character_list = [
        # Treat narrator as a character who is always present without a summary
        (
            "narrator",
            CharacterInfo(
                checksum=0,
                entry_ids=IndexedSet(),
                character_id="narrator",
                summary=Segment(),
            ),
        )
    ]
    
    # =============================================================================== # 
    #                           Processing Character Entries 
    # ================================================================================#
    
    # We now Process each character in the story. We obtain the following:
    # - Their ID, their associated checksum, their summary which is tokenized
    # - Finally, we encapsulate all of it in the dataclass CharacterInfo. 
    for character in characters:
        character_id = character.get("character_seq_id")
        if not character_id:
            continue


        character_id = f"character:{character_id}"

        character_info = (
            processed.characters.get(character_id, None) if processed else None
        )
        

        # Compute the checksum for the character
        checksum = checksum_character(character, character_id)
        if not character_info or character_info.checksum != checksum:
            # Haven't processed this character before, so process it now
            character_info = CharacterInfo(
                checksum=checksum,
                entry_ids=IndexedSet(),
                character_id=character_id,
                summary=summarize_character(character,tokenizer),
            )

        character_list.append(
            (
                character_id,
                character_info,
            )
        )
        


    all_characters = IndexedDict(character_list)
    
    # =============================================================================== # 
    #                           Processing Scene Entries 
    # ================================================================================#    
    
    # same as characters. Obtain id, checksum and tokenized summaries and then encpasulate
    # in dataclass entry_info. 
    
    entry_list: List[Tuple[str, EntryInfo]] = []
    establishment_list: List[Tuple[str, EntryInfo]] = []
    for scene in scenes:
        entries = scene.get("entries", [])
        if not entries or not isinstance(entries, Sequence):
            continue

        for entry in entries:
            entry_id = entry.get("seq_id", None)
            if entry_id is None:
            
                continue
                

            
            checksum = checksum_entry(entry, entry_id)
            

            entry_info = (
                processed.entries.get(entry_id, None) if processed else None
            )
            if not entry_info or entry_info.checksum != checksum:
                # Haven't processed this entry before, so process it now
                entry_info = process_entry(
                    tokenizer,
                    entry,
                    establishment_list[-1][0] if establishment_list else entry_id,
                    checksum,
                )
            if not entry_info:
                continue

            entry_list.append((entry_id, entry_info))
            entry_format = entry.get("format")
            if entry_format == "establishment":
                establishment_list.append((entry_id, entry_info))

            character_info = (
                all_characters[  # pylint:disable=unsubscriptable-object
                    entry["role"]
                ]
            )

            character_info.entry_ids.insert(entry_id)


    return ProcessedStory(
        game_id=story["game_pid"],
        entries=IndexedDict(entry_list),
        characters=all_characters,
        establishment_entries=IndexedDict(establishment_list),
    )

# Preprocessing a single story 

*To be done. We will do a before vs after of preprocessed story. For now I am just showing how to feed a story inside the function. Hopefully, we will be able to finetune soon. Apologies for the delay but work has been alot and the dataset has really taken alot of time to understand. 




In [41]:


dataset_path = r'C:\Users\AWCD\OneDrive\Desktop\CS438 Generative AI\Project\storium_2019_08_22'    
file_path = '/full_export/5/9/591cca.json'
total_path = dataset_path + file_path

with open(total_path, 'r',encoding='utf-8') as file:
    story_data = json.load(file) 


In [42]:
from transformers import AutoTokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
print(type(tokenizer))


processed_story = process_story(story_data, tokenizer) # as talked about 
                                              # we would now need to explicitly 
                                              # pass tokenizer


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<class 'transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer'>
the total number of scenes are  54
the total number of characters are  54
