# Processing a single Story 

In this exposition, my main aim is to preprocess a single story in a way that we can send it to our model. Considering the complexity of the storium data, this is an extensive task. Let us see what it involves. Firstly, we have defined some useful datastructures that we would need for preprocessing in **datastructures.py**. Let me elaborate upon the nature of these. We have three of them: 

1. **Class Trim**
2. **Class IndexedSet** 
3. **Class IndexedDic** 



### Imports & Datafiles 

In [1]:
import argparse
import bisect
import glob
import heapq
import json
import logging
import os
from enum import Enum, auto, unique
from numbers import Number
from typing import (Any, Dict, Iterable, List, Mapping, Optional, Sequence,
                    Tuple, TypeVar, Union)

import torch
import zlib
from contextlib import contextmanager
from dataclasses import dataclass
from itertools import islice, zip_longest
from typing import (Any, Dict, Iterable, List, Optional, Sequence, Tuple,
                    TypeVar, Union)

from kiwisolver import Constraint, Solver, Variable, strength
from transformers import AutoTokenizer, PreTrainedTokenizer


<details>    
<summary>
    <font size="5" color="purple"><b> Understanding Trim, IndexedDict and IndexedSet functionalities </b></font>
</summary>
<p>
<ul>
- The **trim class** defines how a data sequence should be trimmed it is exceeds a certain length. This is useful for us because we are going to be sending lengths which would far exceed the maximum capacity of the tokenizer (more than 1024 words for GPT-2. Hence, the trim class will be used to trim the segments which become too long 
    
- The **IndexedSet** basically allows only unique elements to remain inside. However, it is distinct from the normal set class in that it allows for efficient indexing and insertion operations (so it maintains the order of the elements unlike the normal set dataclass)
    
- While a standard dictionary in Python is accessed via keys, **IndexedDict** allows acces via integer indices as well. 

</ul>
</p>

 

In [2]:
from datastructures import (
    IndexedDict,
    IndexedSet,
    Trim,
)

# Segments 

As we talked about in our previous exposition, the main uniqueness of the dataset are segments itself. These bring different cards together and effectively preprocesses them so that we can send them inside our model. Now, in data exploration, we talked about constraints and how these can be implemented using Cassoway Algorithm. Constraints are essential because they offer a way to control the length and contents of the segments. Since many of the models that we would be finetuning with allow a fixed context window, these constraints will allow us to flexibiliy change the context window. We don't need to worry about the creation of segments as the process itself is streamlined in the github repository associated with the dataset. A high-level overview would be fine for the project we are considering (which builds on the original work as we talked about).




</ul>
</p>

<details>    
<summary>
    <font size="5" color="purple"><b>Overview of Segments </b></font>
</summary>
<p>
<ul>
 What we need to define are the following: 
    
    
- **Segment_ids** : these will associate a unique ID with each segment.
- **EOS separator** : this will represent special tokens like a separator or end-of-sequence token.
- **trim, constrained**, **naive_max_length**, **preferred_length**, **length:** will control how the segment is trimmed and constrained in length.

As for constraint management, methods like hard_constraints, medium_constraints, and strong_constraints define various levels of constraints that can be applied to the segment's length, providing a flexible mechanism for controlling segment size and composition.


</ul>
</p>

In [3]:
from segments import (
    Segment,
)

# Story Flow and Preprocessing

These are essential parts of the code that we need to understand if we want to implement and build on the dataset as our project requires. Therefore, I have decided to bring many of the functionalities out of the general preprocessing class and see how each of the story is being preprocessed at every step of the way. On my way, I have modified some of the functions so that their logic is clear to everyone in the group.  My main aim with this is to see step-by-step how each story is being processed. This will ensure that we are familiar with the dataset to such extent that we would not have any difficulty when we actually implement the project.  

Firstly, let me talk about the class **SpecialToken**. This one is more important than others conceptually. We know that during finetuning, we are going to sending across various cards such as **character cards**, **scenes*, **establishment** entries. The question arises as to how will our model be able to meaningfully seperate between these? (afterall at the end your just feeding numbers to the model. There has to be some correlation in those numbers that your model needs to pick on for it to meaningfully learn). 

Well, we can use special tokens to do this. Special Tokens identify parts of narrative like characters and actions. In particular, we can sandwitch cards between special tokens unique to them. For example, if we define a token like character, we will enclose it with the value *<|CHARACTER|>* 
... 

*The question of how effective this strategy is should be explored. That is, how best to demarcate our data so the model can best pick up on various correlations*



In [162]:
@unique
class SpecialToken(str, Enum):
    """
    An enumeration of special tokens
    """

    # This method must be defined before calling "auto()"
    # pylint:disable=unused-argument,no-self-argument
    def _generate_next_value_(name, start, count, last_values):
        """
        Automatically create the enumeration name
        """
        return f"<|{name.upper()}|>"

    # pylint:enable=unused-argument,no-self-argument

    @classmethod
    def from_string(cls, name: str):
        """
        Get the associated SpecialToken from the passed in string
        """
        if name == "name":
            # Handle this specially since Enum defines "name" as a string, but
            # we want to use it to extract the field from the data
            name = "name_field"
        print(name)

        return cls(f"<|{name.upper()}|>")

    missing = auto()  # denotes missing information
    separator = auto()  # separator token at the beginning of each Segment
    character = auto()  # a character's biography

    # All the possible card types from <CardNamespace> in the Storium export
    # format: https://storium.com/help/export/json/0.9.2
    chartype = auto()
    goal = auto()
    person = auto()
    place = auto()
    thing = auto()
    strength = auto()
    weakness = auto()
    obstacle = auto()
    subplot = auto()

    # generic attributes for cards, characters, etc
    name_field = auto()  # cannot call it "name", as Enum defines it as well
    description = auto()

    # Contextual card attributes
    failure_stakes = auto()
    success_stakes = auto()

    # Information denoting entry type
    move = auto()
    establishment = auto()
    addition = auto()
    conclusion = auto()

    # some notions of ordering
    previous = auto()  # can stack, e.g. previous + previous => timestep t-2

    # some notions of authorship
    narrator = auto()  # can stack,  e.g. previous + narrator
    same_character = auto()  # can stack,  e.g. previous + same_character
    diff_character = auto()  #  can stack,  e.g. previous + diff_character

    def __str__(self):
        """
        Override the default string method to return the enumeration value,
        which is a string
        """
        return self.value


# Encode & Encode Special function 

The **encode function** is pretty standard. You can think of it as performing normal tokenization as must be carried for all finetuning tasks. The authors have also added a functionality that suppress warnings related to text length exceeding the maximum length allowed by the tokenizer. However, all in all, its a pretty normal function. Let me briefly mention the **extract_string** function in this context. It is called again and again when we are tokenizing text. What it does is that it retrieves a string value from a dictionary based on a specified field name. If the field exists in the dictionary and its corresponding value is not None, the function returns that value as a string. 


**The Encode Special** is of significant interest. Basically, the *encode function extends the encode function by adding functionality to handle special tokens and create a Segment.* Hence, after tokenization has occured, the process of **segmentation** as discussed in the original paper is carried out by encode_special. 




In [163]:
def encode(
    string_or_list: Union[str, List[str]],
    tokenizer,
    max_length: int

) -> List[int]:
    """
    PreTrainedTokenizer.encode outputs warnings if the text being tokenized
    is longer than the max_length specified in the tokenizer.
    Unfortunately, the order of operations is to warn first, then truncate
    to the max length that was passed in, resulting in spurious warnings,
    so we wrap the function to suppress these warning messages.
    """
    logger = logging.getLogger(PreTrainedTokenizer.__module__)
    log_level = logger.getEffectiveLevel()
    logger.setLevel(logging.ERROR)

    # Check if string_or_list is a list and join it if necessary
    text_to_encode = " ".join(string_or_list) if isinstance(string_or_list, list) else string_or_list
    
    # ================================ Unprint the line to see the text we are encoding ==================== # 
    
    #print(text_to_encode)
    
    # ======================================================================================================
    if max_length is not None:
        tokens = tokenizer.encode(
            text_to_encode , max_length=1024  
        )
        
        # ================================ Unprint the line to see the tokenized text  ==================== # 
        
        #print(tokens)
        
        # ======================================================================================================

    else:
        tokens = tokenizer.encode(text_to_encode)
        
    logger.setLevel(log_level)
    return tokens

In [164]:
def encode_special(
    string_or_list: Union[str, List[str]],
    tokenizer,
    special_token: Optional[SpecialToken] = None,
    separator_token_id: Optional[int] = None,
    eos_token_id: Optional[int] = None,
    preferred_length: int = 0,
    trim: Trim = Trim.end,
) -> Segment:
        """
        After encoding with the tokenizer, this creates, create a Segment and
        assign the special_token if specified.
        """
        #print("special token is ",special_token )
        #print("seperator token id is ",separator_token_id )

        return Segment(
            encode(string_or_list,tokenizer,1024), # WE NOW PASS TOKENIZER AND MAX LENGTH 
                                                    # EXPLICITLY
            separator=tokenizer.convert_tokens_to_ids(SpecialToken.separator)
            if separator_token_id is None
            else separator_token_id,
            eos=eos_token_id,
            segment_ids=[tokenizer.convert_tokens_to_ids(special_token)]
            if special_token
            else tuple(),
            preferred_length=preferred_length,
            trim=trim,
        )

In [165]:
def extract_string(
    field: str, mapping: Dict[str, Any], default: str = SpecialToken.missing.value
) -> str:
    """
    Extract the given string field, accounting for the potential that it is
    specified as None
    """
    # ============================================================= #
    #print("FLOW OF COMPUTATION 19 ")
    # ============================================================= #

    return mapping.get(field, default) or default

# Dataclasses 

The code has dataclasses. Basically when our entire data would be processed, it would be in the form of these dataclasses. 

1. **ProcessedStory** - Our entire story would be in the form of this dataclass. It contains three things: 
    - *characters*
    - *entries*
    - *establishment_entries* 
- These would serve as our main cards. The entries and establishment entries would be in the form of **Entry Info** and character entries would be in the form of **Character Info**. Notice that they are in the form of IndexedDict.

2. *Entry Info* -  Container for entries and establishment entries. 
2. *Character Info* - Container for characters.  

In [166]:
@dataclass
class CharacterInfo:
    """
    The processed character info
    """
    
    # ============================================================= #
    #print("Inside CharacterInfo")
    # ============================================================= #

    summary: Segment
    character_id: str
    checksum: int

    # This is a sorted list of entry ids written by the character to
    # allow easily looking up the previous entries for the character
    entry_ids: IndexedSet

In [167]:
@dataclass
class EntryInfo:
    """
    The processed entry info
    """


    entry_id: str
    character_id: str
    establishment_id: str
    checksum: int
    text: Segment
    summary: Segment


In [168]:
@dataclass
class ProcessedStory:
    """
    This defines the structure of a story after processing
    """


    
    game_id: str

    # A mapping of character id to character info
    characters: IndexedDict[CharacterInfo]

    # A mapping of entry id to entry info
    entries: IndexedDict[EntryInfo]

    # A mapping of entry id to establishment's entry info
    establishment_entries: IndexedDict[EntryInfo]


# Checksums 

For each card, there are checksums. This allows for selective reprocessing of data. Only the data that has changed (as indicated by a changed checksum) needs to be reprocessed and re-encoded for training. Hence, with entry associated card, we have a checksum. 

In [169]:
def checksum_card(card: Optional[Dict[str, Any]], checksum: int = 1) -> int:
    """
    Checksum the card.
    """
    if not card:
        return checksum

    for field in ("name", "description", "success_stakes", "failure_stakes"):
        checksum = zlib.adler32(
            extract_string(field, card).encode("utf-8"), checksum
        )

    return checksum

In [170]:
def checksum_cards(cards: List[Dict[str, Any]], checksum: int = 1) -> int:
    """
    Create the summary of a card
    """
    for card in cards:
        checksum = checksum_card(card, checksum)

    return checksum


In [171]:
def checksum_character(character: Dict[str, Any], character_id: str) -> int:
    """
    Compute a checksum of a character
    """
    checksum = zlib.adler32(character_id.encode("utf-8"))
    for field in ("name", "description"):
        checksum = zlib.adler32(
            extract_string(field, character).encode("utf-8"), checksum
        )

    return checksum

In [172]:
def checksum_entry(entry: Dict[str, Any], entry_id: str) -> int:
    """
    Compute a checksum of an entry
    """
    checksum = zlib.adler32(entry_id.encode("utf-8"))
    entry_type = entry["format"]
    if entry_type == "move":
        checksum = checksum_card(entry.get("target_challenge_card"), checksum)
        checksum = checksum_cards(
            entry.get("cards_played_on_challenge", []), checksum
        )
    elif entry_type == "establishment":
        checksum = checksum_card(entry.get("place_card"), checksum)
    elif entry_type == "addition":
        checksum = checksum_cards(entry.get("challenge_cards", []), checksum)

    return zlib.adler32(
        extract_string("description", entry, "").encode("utf-8"), checksum
    )

# Summaries of Cards

These are the actual functions that preprocesses each entry and tokenizes them. 

In [173]:
def summarize_character(character: Dict[str, Any], tokenizer) -> Segment:
    """
    Create the summary for a character
    """
    name_encoded = encode_special(
        extract_string("name", character),
        tokenizer,
        SpecialToken.from_string("name"),
        separator_token_id=tokenizer.bos_token_id,
    )
    
    #print("########################### NAME ENCODING DONE #########################")
    
    description_encoded = encode_special(
        extract_string("description", character),
        tokenizer,
        SpecialToken.from_string("description"),
    )
    
    #print("########################### DISCRIPTION ENCODING DONE #########################")

    
    encoded_fields = [name_encoded, description_encoded]
    
    return Segment(
        iter(encoded_fields),
        segment_ids=[tokenizer.convert_tokens_to_ids(SpecialToken.character)],
    )

In [174]:
def summarize_card(tokenizer, card: Optional[Dict[str, Any]]) -> Segment:
    """
    Create the summary of a card.

    If it's a challenge card, then it'll have "success_stakes" and
    "failure_stakes" as well.
    """
    if not card:
        return Segment()

    return Segment(
        iter(
            encode_special(
                string_or_list=extract_string(field, card),
                tokenizer=tokenizer,
                special_token=SpecialToken.from_string(field),
            )
            for field in ("name", "description", "success_stakes", "failure_stakes")
            if card.get(field)
        ),
        segment_ids=tuple(
            tokenizer.convert_tokens_to_ids(
                (SpecialToken.from_string(card["namespace"]),)
            ),
        ),
    )

In [175]:
def summarize_cards(tokenizer,cards: List[Dict[str, Any]]) -> Segment:
    """
    Create the summary of a card
    """
    return Segment(iter(summarize_card(tokenizer,card) for card in cards))

In [176]:
def summarize_entry(tokenizer,entry: Dict[str, Any]) -> Segment:
    """
    Create the summary of an entry
    """
    summary = []
    entry_type = entry["format"]
    if entry_type == "move":
        challenge = summarize_card(tokenizer,entry.get("target_challenge_card"))
        if challenge:
            summary.append(challenge)

        cards = summarize_cards(tokenizer,entry.get("cards_played_on_challenge", []))
        if cards:
            summary.append(cards)
    elif entry_type == "establishment":
        place = summarize_card(tokenizer,entry.get("place_card"))
        if place:
            summary.append(place)
    elif entry_type == "addition":
        cards = summarize_cards(tokenizer, entry.get("challenge_cards", []))
        if cards:
            summary.append(cards)

    return Segment(
        summary,
        segment_ids=tuple(
            tokenizer.convert_tokens_to_ids(
                (SpecialToken.from_string(entry_type),)
            ),
        ),
    )

# The story details 

We now consider the main story processing. This is done by **process_story** function. The workflow of this function is the following:  
1. **Extract Scenes and Characters:** It starts by extracting scenes and characters from the story dictionary. If these are not present or not in the correct format, the function returns the processed object if it exists, effectively skipping processing.
2. **Initialize Character List:**  A list of characters is initialized, starting with a default narrator character entry, which is always present in Storium stories but without a detailed summary (it has a checksum of 0, an empty entry_ids set, and an empty Segment as summary).

3. **Process Characters:** Iterate over each character in the characters list. Generate a character_id from the character_seq_id and prefix it with character:.

4. **Process Scenes and Entries:** : Iterate over each scene in scenes, and within each, iterate over its entries. For each entry, compute its checksum and determine if it needs processing based on whether it has changed from the previously processed version. Process the entry using process_entry, which tokenizes and structures the entry's text, and associates it with the relevant character and scene information.

5. **Construct ProcessedStory Object:** Compile the processed data into a ProcessedStory object, containing the structured data for the entire story, including mappings of characters and entries.

The function utilizes another function called **process_entry**. The process_entry function is designed to process a single entry in a narrative or dataset, such as a character's action or a segment of a story, and encapsulate the processed data into an EntryInfo object


In [177]:
# for now I set the preffered length as a hyperparameter manually. 
# It is set to be 256

def process_entry(
    tokenizer,
    entry: Dict[str, Any],
    establishment_id: str,
    checksum: int,
    add_eos: bool = True,
    force: bool = False,
) -> Optional[EntryInfo]:
    """
    Process a character entry
    """
    ###  SETTING HYPERPARAMETERS ON MY OWN ############
    preferred_entry_length = 256 
    
    ############################################
    
    text = extract_string("description", entry, "")
    if not text and not force and entry.get("format") != "establishment":
        # Only modeling moves with written text, though make a special
        # exception for establishment entries. While they are currently
        # required to have text, it seems at some point there were games that
        # didn't have any text for the establishment entry, though it would still
        # have place cards.
        return None

    
    encoded_text = encode_special(
        text,
        tokenizer=tokenizer,
        special_token=SpecialToken.from_string(entry["format"]),
        preferred_length=preferred_entry_length,
        eos_token_id=tokenizer.eos_token_id if add_eos else None,
    )
    summary = summarize_entry(tokenizer,entry)
    if not summary:
        summary = encode_special(
            string_or_list=text,
            tokenizer=tokenizer,
            special_token=SpecialToken.from_string(entry["format"]),
            trim=Trim.start,  # Treat the end of the entry text as a summary
        )

    return EntryInfo(
        checksum=checksum,
        entry_id=entry["seq_id"],
        character_id=entry["role"],
        establishment_id=establishment_id,
        text=encoded_text,
        summary=summary,
    )

# Process_story

In [178]:
def process_story(story: Dict[str, Any], tokenizer, processed: Optional[ProcessedStory] = None) -> Optional[ProcessedStory]:


    # HERE WE OBTAIN THE SCENES 
    scenes = story.get("scenes")
    print("the total number of scenes are ", len(scenes))
    #print("the first scene is ", scenes[2]) # unprint this to see the contents of all the story. 
    #print("the keys associated with this unprocessed scene is ", scenes[2].keys())
    
    # HERE WE OBTAIN ALL THE CHARACTERS 
    characters = story.get("characters")
    #print("the total number of characters are ", len(scenes))
    #print("a character is ", characters[1])
    #print(characters[1].keys())

    # If either scenes or characters are missing, or scenes is not a proper sequence,
    # we return previously processed data if available
    if not scenes or not characters or not isinstance(scenes, Sequence):
        return processed
    
    #We now create the character_list. To do this, we first sort the entry_ids using indexedSet(). The character_id
    # is set to the narrative. To obtain the actual contents which is found in summary, we send it to the Segment ()
    # class. 

    character_list = [
        # Treat narrator as a character who is always present without a summary
        (
            "narrator",
            CharacterInfo(
                checksum=0,
                entry_ids=IndexedSet(),
                character_id="narrator",
                summary=Segment(),
            ),
        )
    ]
    
    #print(character_list)
    
    # =============================================================================== # 
    #                           Processing Character Entries 
    # ================================================================================#
    
    # We now Process each character in the story. We obtain the following:
    # - Their ID, their associated checksum, their summary which is tokenized
    # - Finally, we encapsulate all of it in the dataclass CharacterInfo. 
    for character in characters:
        character_id = character.get("character_seq_id")
        if not character_id:
            continue


        character_id = f"character:{character_id}"

        character_info = (
            processed.characters.get(character_id, None) if processed else None
        )
        

        # Compute the checksum for the character
        checksum = checksum_character(character, character_id)
        if not character_info or character_info.checksum != checksum:
            # Haven't processed this character before, so process it now
            character_info = CharacterInfo(
                checksum=checksum,
                entry_ids=IndexedSet(),
                character_id=character_id,
                summary=summarize_character(character,tokenizer),
            )
        
        
            

        character_list.append(
            (
                character_id,
                character_info,
            )
        )
        


    all_characters = IndexedDict(character_list)
    
    # =============================================================================== # 
    #                           Processing Scene Entries 
    # ================================================================================#    
    
    # same as characters. Obtain id, checksum and tokenized summaries and then encpasulate
    # in dataclass entry_info. 
    
    entry_list: List[Tuple[str, EntryInfo]] = []
    establishment_list: List[Tuple[str, EntryInfo]] = []
    for scene in scenes:
        entries = scene.get("entries", [])
        if not entries or not isinstance(entries, Sequence):
            continue

        for entry in entries:
            entry_id = entry.get("seq_id", None)
            if entry_id is None:
            
                continue
                

            
            checksum = checksum_entry(entry, entry_id)
            

            entry_info = (
                processed.entries.get(entry_id, None) if processed else None
            )
            if not entry_info or entry_info.checksum != checksum:
                # Haven't processed this entry before, so process it now
                entry_info = process_entry(
                    tokenizer,
                    entry,
                    establishment_list[-1][0] if establishment_list else entry_id,
                    checksum,
                )
            if not entry_info:
                continue

            entry_list.append((entry_id, entry_info))
            entry_format = entry.get("format")
            if entry_format == "establishment":
                establishment_list.append((entry_id, entry_info))

            character_info = (
                all_characters[  # pylint:disable=unsubscriptable-object
                    entry["role"]
                ]
            )

            character_info.entry_ids.insert(entry_id)


    return ProcessedStory(
        game_id=story["game_pid"],
        entries=IndexedDict(entry_list),
        characters=all_characters,
        establishment_entries=IndexedDict(establishment_list),
    )

# Preprocessing a single story 

With the length preprocessing task defined above, let us now down a before vs after of a preprocessed story. 



In [179]:
dataset_path = r'C:\Users\AWCD\OneDrive\Desktop\CS438 Generative AI\Project\storium_2019_08_22'    
file_path = '/full_export/5/9/591cca.json'
total_path = dataset_path + file_path

with open(total_path, 'r',encoding='utf-8') as file:
    story_data = json.load(file) 


In [180]:
from transformers import AutoTokenizer
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
print(type(tokenizer))


processed_story = process_story(story_data, tokenizer) # as talked about 
                                              # we would now need to explicitly 
                                              # pass tokenizer


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<class 'transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer'>
the total number of scenes are  54
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
name_field
description
establishment
place
name_field
description
establishment
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
addition
addition
addition
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
addition
addition
addition
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
obstacle
name_field
de

move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
addition
addition
addition
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
move
move
move
move
move
addition
addition
addition
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
addition
addition
addition
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
move
move
conclusion
conclusion
conclusion
establishment
place
name_field
description
establishment
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
des

success_stakes
failure_stakes
thing
name_field
description
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
thing
name_field
description
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
conclusion
conclusion
conclusion
establishment
place
name_field
description
establishment
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
mov

obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
establishment
establishment
establishment
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
addition
addition
addition
addition
addition
addition
addition
addition
addition
addition
addition
addition
addition
addition
addition
addition
addition
addition
move
move
move
move
move
move
conclusion
conclusion
conclusion
establishment
place
name_field
description
establishment
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
desc

addition
addition
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
establishment
place
name_field
description
establishment
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
subplot
name_field
description
move
addition
addition
addition
move
move
move
move
move
move
move
move
move
addition
obstacle
name_field
description
success_stakes
failure_stakes
obstacle
name_field
description
success_stakes
failure_stakes
obstacle
name_field
description
success_stakes
failure_stakes
addition
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
move
move
move

description
success_stakes
failure_stakes
strength
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
move
move
addition
obstacle
name_field
description
success_stakes
failure_stakes
addition
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
thing
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
addition
obstacle
name_field
description
success_stakes
failure_stakes
obstacle
name_field
description
success_stakes
failure_stakes
obstacle
name_field
description
success_stakes
failure_stakes
addition
move
obstacle
name_field
descripti

move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_f

success_stakes
failure_stakes
addition
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
move
move
addition
obstacle
name_field
description
success_stakes
failure_stakes
obstacle
name_field
description
su

obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
strength
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
move
move
move
m

obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
person
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
person
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
person
name_field
description
success_stakes
failure_stakes
thing
name_field
description
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
person
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
person
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
person
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
person
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
person
name_field
description
success_stakes
failure_stakes
weaknes

conclusion
conclusion
establishment
establishment
establishment
addition
obstacle
name_field
description
success_stakes
failure_stakes
addition
addition
person
name_field
description
success_stakes
failure_stakes
addition
move
move
move
addition
obstacle
name_field
description
success_stakes
failure_stakes
addition
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
move
move
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
move
move
move
move
move
move
person
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
move
move
addition
obstacle
name_field
description
success_stakes
failure_stakes
obstacle
name_field
description
success_stakes
failure_stakes
obstacle
name_field
description
succ

obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
addition
addition
addition
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
weakness
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
thing
name_field
description
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
strength
name_field
description
move
addition
addition
addition
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
addition
person
name_field
description
success_stakes
failure_stakes
addition
move
move
move
addition
addition
addition
move
person
name_field
description
success_stakes
failure_stakes
thing
name_fiel

obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
conclusion
conclusion
conclusion
establishment
place
name_field
description
establishment
addition
obstacle
name_field
description
success_stakes
failure_stakes
obstacle
name_field
description
success_stakes
failure_stakes
obstacle
name_field
description
success_stakes
failure_stakes
addition
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
addition
obstacle
name_field
description
success_stakes
failure_stakes
obstacle
name_field
description
success_stakes
failure_stakes
addition
move
move
move
m

addition
addition
addition
move
move
move
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
weakness
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
subplot
name_field
description
move
move
obstacle
name_field
description
success_stakes
failure_stakes
goal
name_field
description
move
move
move
move
conclusion
conclusion
conclusion
establishment
establishment
establishment
move
move
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
strength
name_field
description
move
move
move
move
move
obstacle
name_field
description
success_stakes
failure_stakes
weakness
name_field
description
move
move
obstacle
name_field
descri

# All the information flow: 

Finally, let us look at a complete example of processing a story. We would trace the function **process_story** for a particular story. In total for the story we are processing, we have: 
1. 54 Scenes
2. 54 Characters 

Lets look at an *unprocessed scene*. It is basically a dictionary of dictionary. For this explanation, I just use take scene 3 for illustrative purposes: 
    `scenes[2]`. 
  
I then get all the associated keys for this dictionary using `scenes.keys()`. The keys we obtain are the following: 

['scene_id', 'act_number', 'chapter_number', 'scene_number', 'is_final', 'is_ended', 'cast_character_seq_ids', 'entries', 'comments'] 

These contain all the relevant information for scenes in their unprocessed form. 

Similarly, we obtain a single character using `character[1]` and the associated key using `characters.keys()`. The keys we obtain are: 

['user_pid', 'character_seq_id', 'name', 'description', 'was_ever_submitted', 'submit_date', 'is_pregen_instance', 'is_host_character', 'has_ugc', 'has_contributed', 'via_open_invites', 'was_auto_approved', 'curscene_last_move_date', 'redacted', 'status', 'is_approved', 'is_retired', 'initial_hand_cards', 'current_hand_cards', 'image']



The first thing that we concern ourselves with is processing **characters**. To do this, we create a `character_list`. The first **element** in our character_list is the narrator itself. If you uncomment the print statement above, you'll find that it is in the following format (notice that the checksum for the narrator is 0): 

`[('narrator', CharacterInfo(summary=(), character_id='narrator', checksum=0, entry_ids=[]))]`

This is in the format of self-created dataclass **CharacterInfo** which we talked about so many times. All the characters will be processed in the same format. For each character, we take the unique **character_id** from the character key and compute a unique **checksum** using the function *checksum_character* for each character. EntryId is created using IndexedSet. The most important part is **summary** which is computed using **summarize_character** which is a segment in the following form: 

`encoded_fields = [name_encoded, description_encoded]`

Lets see an example of encoding for three characters (interestingly, there are some repeats). Note that we only consider **1024 tokens**. If that is exceeded, we perform our constraint optimization that we talked about. 

## Character One Pure Text:

1. **Name**: 
- Alenteile (Lena)

2. **Description**:
- At 5’6 with long auburn hair and grey eyes that can slip silver if they catch the light the right way, 25 year old Alenteile O’Shaughnessy can only be described as disarming. Her bright, energetic personality is warm and engaging with a hint of mischief, as if she’s keeping a secret that she’d love to share. In truth, she’s got two. 

    First, she is in service to a newly arrived vampire master, running his errands and making sure that his day to day needs are met without interruption. Second, she is his lover and share his hunger for young, pretty lovers and *all* that they offer. 

    This petite beauty looks innocent, but she knew at an early age that the sight, smell and taste of blood was a sexual turn on. She tried for a long time to hide that part of herself away, channeling her energies into gymnastics and her academic studies. She knew that to “go goth” would just bring this most taboo secret to the surface. Only once did she come close to letting that part of herself show.

    As a teen, she scratched a lover during sex, drawing blood. Unable to help herself, she attacked, her craving overwhelming her. She was able to inveigle her way out of the situation, but it taught her a powerful lesson – hold fast to your “humanity”, no matter the temptation.

    So she overachieved at “normal” so she could pass in a world that would revile her if they knew her truth. And she got really good at it. She honed her ability to charm folks, getting them to lower their defenses so she could sneak in, around, away if needs be. Then she met someone who not only couldn’t she beguile, but who saw her for who she truly was … Jahan AlBenir.

    She met Jahan at Diva's in her Junior year at Smith College in Western MA. He was exotic, exciting and, as she came to find out, enjoyed her most when she let go of her inhibitions … all of her inhibitions. Eventually, she left school and moved in with him, taking the roles of “daywalker”, housekeeper, lover, procurer. She shares his home, his bed, his lovers and his meals but never his blood, no matter how he pleads with her to join him. The childhood lesson has stayed with her as she adamantly refuses to become an immortal herself … however tempting. And Jahan is forever tempting.


## Character Two Pure Text:


1. **Name**
- Anastasie d'Estouteville
2. **Description **:
- Anastasie d'Estouteville was born in France, 1530, the youngest daughter of the Duke of Estouteville. As the daughter of a duke Anastasie enjoyed many privileges, at an early age she learnt what a woman must in a 16th century french society. She quickly gained impeccable social graces and etiquette and learnt later in her years how to navigate the intrigues of court. Unlike many women of her time her father recognized the potential within her and ensured that she receive a comprehensive education and employed later her within his court. 

    Anastasie knew from an early age that her family were somehow different. Despite their duke and duchess titles her parents were recluses. She rarely saw or spoke to either of her parents, and although it saddened her she was content to live a solitary life of learning. She understood that her parents were busy tending to the needs of their people.

    As the clock tuned 12 o clock on her twenty first birthday Anastasie's life as she knew it was no more and in its place was a sordid life of corruption and deceit. As she awoke to the sound of whispering she noted that she was not in her room nor her home, she was surrounded by veiled figures in a damp and somber room. It was in that room she first heard the tales of creatures that live in darkness and preyed upon the people of her land. It was in that room she was told that the tales were not works of fiction but of reality. And it was in that room she was told she would on her twenty forth birthday she would be initiated into their society of vampires of which her father,the Duke, headed.

    Anastasie was soon forced to become a vampire and in the three years that followed she was taught how to be the perfect predator. How to kill devoid of emotion, how to stalk and hunt pray at night and made to memorize the history and philosophy of their society. Anastasie within the period of a few short years morphed from a kind hearted child to a woman she no longer recognized. The society in many ways had accomplished their goals, she became ruthless, at the best amoral, cold and unforgiving. She was truly the perfect vampire, stronger and quicker than those far older than herself. Although there was one factor the society had not accounted for. Anastasie despised the society, and loathed her family for all they had done to her.

    On Anastasie's twenty fourth birthday  (now three years a vampire), when she was officially part of the society and permitted to leave her compound she made her way to her families castle and slaughtered every inhabitant in the dead of night, including her mother and father. After obtaining her revenge she fled the country in a futile attempt to avaid the societies far reaching influence. 

    Anastasie has spent every year since enjoying a hedonistic lifestyle while traveling the globe witnessing major historical events. She has seen first hand  the colonization of the Americas, the protestant reformation, the french revolution and countless other monumental events. 

    Anastasie is learned and wise, throughout the years she has retained and improved upon her diplomatic and court skills. A natural leader she is authoritative, inspiring and realistic paired with a cunning mind makes her a master at the game. She has risen high in the vampire hierarchy, being a powerful member of the former princes court. Anastasie has recently become a rich businesswoman as she dominates the lucrative underground vampire social life scene, being the proud owner of the most prominent and notorious vampire clubs and blood dens.

    To anyone looking in from the outside Anastasie's life would seem a dream, having a nearly infinite amount of wealth and power base, two beloved fledglings and a truly decadent lifestyle. However hundreds of years of living a ruthless and barbaric life have taken its tole upon her. Her past continues to haunt her every waking moment. A single question persistently reverberate within her mind.

    Have I over the decades become the very person I despise the most in this world?

    Determined to prove to herself that she is nothing like her father Anastasie has set out to create a family for herself; one which she will cherish and protect till the end of her undead life. One which she will willingly want to trade for  her current life of hedonism which over the decades has lost its luster.

    As Anastasie embarks upon a new journey so marks the beginning of her diminishing mental health. As her mind gradually shatters she becomes victim to sudden outbreaks of violence, blackouts and hallucinations...

    **Description**

    *Towering over most at 6'2 feet tall Anastasie intimidates most with but a glance of her hyacinth eyes, her unnaturally chiseled features make even the most beautiful of women feel inadequate within her presence. Seemingly endless strands of tight flaxen hair cascades in ringlets around Anastasie's ethereal ashen face down to her lower waist. For the few who have no knowledge of her reputation her innocent and trusting appearance can be used strategically to her advantage often leading to dire consequences for the individual that is fooled by her facade.*

    *As Anastasie is often either hosting or attending some sort of event she is seen by most in a variety of stylistic and elegant dresses of the most prestigious and exclusive designers; always accompanies by her old ruby necklace which she was gifted as a child. When she roams the streets at night she favors dark and intimidating apparel consisting most nights of leather and skintight denim*

    *Underneath the deceiving fabric lies evidence of a life long past. A variety of tattoos cover her slim yet curvaceous body in place of the scars she once wore before and during her three year period of transformstion with elaborate art describing the situations in which they were acquired and what lesson she may have learned from the experience. Anastasie reapplies the fading tattoos each week and has done so for nearly five hundred years. The only brand which is truly permanent is that of her societies initials positioned on the back of both of her wrists.*


## Character Three Pure Text:

1. **Name**
- Anastasie d'Estouteville
2. **Description**:

- *This character was originally conceived and played by JasmineMJ who retired from the game. The character has been modified and reincarnated for story continuity.*

    Anastasie d’Estouteville was born in France, 1530, the youngest daughter of the Duke of Estouteville. As the daughter of a duke Anastasie enjoyed many privileges, at an early age she learned what a woman must in a 16th century french society. Impeccable social graces and etiquette and how to  navigate the intrigues of court, and the management of a large household. Unlike many women of her time she also received a comprehensive education in literature and language.

    On her twenty fourth birthday Anastasie awoke to the sound of whispering. She was surrounded by veiled figures in a damp and somber room where she was told she would be initiated into a society of vampires and would become one of them. Her father the Duke had sacrificed her to their society for power and advantage, much as he would have sold her into a political marriage, but this was a much worse fate.

    There was no escape, and so Anastasie chose to excel so that she could someday take her revenge on the society she despised and the family she loathed for what they had done to her.

    She was taught to be the perfect predator. How to kill devoid of emotion, how to stalk and hunt prey at night, and she was made to learn all the customs, rituals, and history of their society.  She became ruthless, at the best amoral, cold and unforgiving.

    On the third anniversary of her Making, Anastasie calmly slaughtered every inhabitant of her family's castle in the dead of night, including her mother and father. She left the country, but knowing no other way of life, Anastasie was inexorably drawn back into a series of vampire societies, eventually finding a place of relative peace and stasis in Boston with the Prince.

    Through the years Stasie has improved upon her diplomatic and court skills. She has risen high in the vampire hierarchy, becoming a powerful member of the former Prince of Boston's court. She has also succeeded as a businesswoman, dominating the lucrative underground vampire social life scene as the owner of the most fashionable vampire clubs and blood dens.

    To anyone looking in from the outside Anastasie’s life would seem a dream, having a nearly infinite amount of wealth and power base, a beloved fledgling and a truly decadent lifestyle. However hundreds of years of living a ruthless and barbaric life have taken their toll. Her past haunts her. A single question persistently reverberate within her mind. Have I over the decades become the very person I despise the most in this world?

    *Description*

    At 6' feet tall Anastasie can intimidate most with a glance of her hyacinth eyes, her unnaturally chiseled features make even the most beautiful of women feel inadequate within her presence. Long strands of tight flaxen hair cascades in ringlets around Anastasie’s ethereal ashen face.

    Anastasie is often either hosting or attending some sort of event she is seen by most in a variety of elegant dresses from the most prestigious and exclusive designers; always accompanied by a ruby necklace which she was gifted as a child.


## Character One Tokenized:

1. **Name_Encoded**: 
- [2348, 21872, 576, 357, 43, 8107, 8]

2. **Description Encoding**:
- [2953, 642, 447, 247, 21, 351, 890, 257, 549, 700, 4190, 290, 13791, 2951, 326, 460, 13819, 8465, 611, 484, 4929, 262, 1657, 262, 826, 835, 11, 1679, 614, 1468, 978, 21872, 576, 440, 447, 247, 2484, 1567, 1108, 88, 460, 691, 307, 3417, 355, 595, 18052, 13, 2332, 6016, 11, 26758, 8806, 318, 5814, 290, 11932, 351, 257, 9254, 286, 38625, 11, 355, 611, 673, 447, 247, 82, 5291, 257, 3200, 326, 673, 447, 247, 67, 1842, 284, 2648, 13, 554, 3872, 11, 673, 447, 247, 82, 1392, 734, 13, 220, 198, 198, 5962, 11, 673, 318, 287, 2139, 284, 257, 8308, 5284, 23952, 4958, 11, 2491, 465, 11454, 1746, 290, 1642, 1654, 326, 465, 1110, 284, 1110, 2476, 389, 1138, 1231, 41728, 13, 5498, 11, 673, 318, 465, 18854, 290, 2648, 465, 16460, 329, 1862, 11, 2495, 20175, 290, 1635, 439, 9, 326, 484, 2897, 13, 220, 198, 198, 1212, 4273, 578, 8737, 3073, 10218, 11, 475, 673, 2993, 379, 281, 1903, 2479, 326, 262, 6504, 11, 8508, 290, 6938, 286, 2910, 373, 257, 3206, 1210, 319, 13, 1375, 3088, 329, 257, 890, 640, 284, 7808, 326, 636, 286, 5223, 1497, 11, 6518, 278, 607, 27598, 656, 38581, 24232, 290, 607, 8233, 3640, 13, 1375, 2993, 326, 284, 564, 250, 2188, 308, 849, 447, 251, 561, 655, 2222, 428, 749, 35899, 3200, 284, 262, 4417, 13, 5514, 1752, 750, 673, 1282, 1969, 284, 9616, 326, 636, 286, 5223, 905, 13, 198, 198, 1722, 257, 6036, 11, 673, 37468, 257, 18854, 1141, 1714, 11, 8263, 2910, 13, 27319, 284, 1037, 5223, 11, 673, 7384, 11, 607, 34357, 9721, 607, 13, 1375, 373, 1498, 284, 287, 303, 328, 293, 607, 835, 503, 286, 262, 3074, 11, 475, 340, 7817, 607, 257, 3665, 11483, 784, 1745, 3049, 284, 534, 564, 250, 10734, 414, 447, 251, 11, 645, 2300, 262, 29062, 13, 198, 198, 2396, 673, 625, 620, 39591, 379, 564, 250, 11265, 447, 251, 523, 673, 714, 1208, 287, 257, 995, 326, 561, 2710, 576, 607, 611, 484, 2993, 607, 3872, 13, 843, 673, 1392, 1107, 922, 379, 340, 13, 1375, 3032, 276, 607, 2694, 284, 20024, 7974, 11, 1972, 606, 284, 2793, 511, 18370, 523, 673, 714, 20528, 287, 11, 1088, 11, 1497, 611, 2476, 307, 13, 3244, 673, 1138, 2130, 508, 407, 691, 3521, 447, 247, 83, 673, 4123, 84, 576, 11, 475, 508, 2497, 607, 329, 508, 673, 4988, 373, 3926, 449, 19210, 978, 11696, 343, 13, 198, 198, 3347, 1138, 449, 19210, 379, 4777, 64, 338, 287, 607, 20000, 614, 379, 4176, 5535, 287, 4885, 8779, 13, 679, 373, 21036, 11, 7895, 290, 11, 355, 673, 1625, 284, 1064, 503, 11, 8359, 607, 749, 618, 673, 1309, 467, 286, 607, 11062, 1756, 3926, 477, 286, 607, 11062, 1756, 13, 16178, 11, 673, 1364, 1524, 290, 3888, 287, 351, 683, 11, 2263, 262, 9176, 286, 564, 250, 820, 20783, 447, 251, 11, 2156, 13884, 11, 18854, 11, 13834, 15051, 13, 1375, 7303, 465, 1363, 11, 465, 3996, 11, 465, 20175, 290, 465, 13840, 475, 1239, 465, 2910, 11, 645, 2300, 703, 339, 3339, 5643, 351, 607, 284, 4654, 683, 13, 383, 9963, 11483, 468, 9658, 351, 607, 355, 673, 23197, 3875, 17567, 284, 1716, 281, 26156, 5223, 3926, 2158, 29850, 13, 843, 449, 19210, 318, 8097, 29850, 13]

# Character two tokenized 

1. **Name_Encoded**
- [2025, 459, 292, 494, 288, 6, 22362, 448, 1990, 8270]

2. **Description_Encoded**
- [2953, 642, 447, 247, 21, 351, 890, 257, 549, 700, 4190, 290, 13791, 2951, 326, 460, 13819, 8465, 611, 484, 4929, 262, 1657, 262, 826, 835, 11, 1679, 614, 1468, 978, 21872, 576, 440, 447, 247, 2484, 1567, 1108, 88, 460, 691, 307, 3417, 355, 595, 18052, 13, 2332, 6016, 11, 26758, 8806, 318, 5814, 290, 11932, 351, 257, 9254, 286, 38625, 11, 355, 611, 673, 447, 247, 82, 5291, 257, 3200, 326, 673, 447, 247, 67, 1842, 284, 2648, 13, 554, 3872, 11, 673, 447, 247, 82, 1392, 734, 13, 220, 198, 198, 5962, 11, 673, 318, 287, 2139, 284, 257, 8308, 5284, 23952, 4958, 11, 2491, 465, 11454, 1746, 290, 1642, 1654, 326, 465, 1110, 284, 1110, 2476, 389, 1138, 1231, 41728, 13, 5498, 11, 673, 318, 465, 18854, 290, 2648, 465, 16460, 329, 1862, 11, 2495, 20175, 290, 1635, 439, 9, 326, 484, 2897, 13, 220, 198, 198, 1212, 4273, 578, 8737, 3073, 10218, 11, 475, 673, 2993, 379, 281, 1903, 2479, 326, 262, 6504, 11, 8508, 290, 6938, 286, 2910, 373, 257, 3206, 1210, 319, 13, 1375, 3088, 329, 257, 890, 640, 284, 7808, 326, 636, 286, 5223, 1497, 11, 6518, 278, 607, 27598, 656, 38581, 24232, 290, 607, 8233, 3640, 13, 1375, 2993, 326, 284, 564, 250, 2188, 308, 849, 447, 251, 561, 655, 2222, 428, 749, 35899, 3200, 284, 262, 4417, 13, 5514, 1752, 750, 673, 1282, 1969, 284, 9616, 326, 636, 286, 5223, 905, 13, 198, 198, 1722, 257, 6036, 11, 673, 37468, 257, 18854, 1141, 1714, 11, 8263, 2910, 13, 27319, 284, 1037, 5223, 11, 673, 7384, 11, 607, 34357, 9721, 607, 13, 1375, 373, 1498, 284, 287, 303, 328, 293, 607, 835, 503, 286, 262, 3074, 11, 475, 340, 7817, 607, 257, 3665, 11483, 784, 1745, 3049, 284, 534, 564, 250, 10734, 414, 447, 251, 11, 645, 2300, 262, 29062, 13, 198, 198, 2396, 673, 625, 620, 39591, 379, 564, 250, 11265, 447, 251, 523, 673, 714, 1208, 287, 257, 995, 326, 561, 2710, 576, 607, 611, 484, 2993, 607, 3872, 13, 843, 673, 1392, 1107, 922, 379, 340, 13, 1375, 3032, 276, 607, 2694, 284, 20024, 7974, 11, 1972, 606, 284, 2793, 511, 18370, 523, 673, 714, 20528, 287, 11, 1088, 11, 1497, 611, 2476, 307, 13, 3244, 673, 1138, 2130, 508, 407, 691, 3521, 447, 247, 83, 673, 4123, 84, 576, 11, 475, 508, 2497, 607, 329, 508, 673, 4988, 373, 3926, 449, 19210, 978, 11696, 343, 13, 198, 198, 3347, 1138, 449, 19210, 379, 4777, 64, 338, 287, 607, 20000, 614, 379, 4176, 5535, 287, 4885, 8779, 13, 679, 373, 21036, 11, 7895, 290, 11, 355, 673, 1625, 284, 1064, 503, 11, 8359, 607, 749, 618, 673, 1309, 467, 286, 607, 11062, 1756, 3926, 477, 286, 607, 11062, 1756, 13, 16178, 11, 673, 1364, 1524, 290, 3888, 287, 351, 683, 11, 2263, 262, 9176, 286, 564, 250, 820, 20783, 447, 251, 11, 2156, 13884, 11, 18854, 11, 13834, 15051, 13, 1375, 7303, 465, 1363, 11, 465, 3996, 11, 465, 20175, 290, 465, 13840, 475, 1239, 465, 2910, 11, 645, 2300, 703, 339, 3339, 5643, 351, 607, 284, 4654, 683, 13, 383, 9963, 11483, 468, 9658, 351, 607, 355, 673, 23197, 3875, 17567, 284, 1716, 281, 26156, 5223, 3926, 2158, 29850, 13, 843, 449, 19210, 318, 8097, 29850, 13]

# Character three tokenized 

1. **Name_Encoded** 
- [2025, 459, 292, 494, 288, 6, 22362, 448, 1990, 8270]
2. **Description_Encoded**
- [2025, 459, 292, 494, 288, 6, 22362, 448, 1990, 8270, 373, 4642, 287, 4881, 11, 1315, 1270, 11, 262, 18887, 4957, 286, 262, 11083, 286, 10062, 448, 1990, 8270, 13, 1081, 262, 4957, 286, 257, 288, 4649, 1052, 459, 292, 494, 8359, 867, 18850, 11, 379, 281, 1903, 2479, 673, 26338, 644, 257, 2415, 1276, 287, 257, 1467, 400, 4289, 48718, 3592, 13, 1375, 2952, 8618, 45707, 540, 1919, 1036, 2114, 290, 49455, 290, 26338, 1568, 287, 607, 812, 703, 284, 16500, 262, 13250, 947, 286, 2184, 13, 12101, 867, 1466, 286, 607, 640, 607, 2988, 8018, 262, 2785, 1626, 607, 290, 30169, 326, 673, 3328, 257, 9815, 3707, 290, 9322, 1568, 607, 1626, 465, 2184, 13, 220, 198, 198, 2025, 459, 292, 494, 2993, 422, 281, 1903, 2479, 326, 607, 1641, 547, 7599, 1180, 13, 7945, 511, 288, 4649, 290, 288, 1229, 33979, 8714, 607, 3397, 547, 302, 2527, 274, 13, 1375, 8365, 2497, 393, 5158, 284, 2035, 286, 607, 3397, 11, 290, 3584, 340, 40434, 607, 673, 373, 2695, 284, 2107, 257, 25565, 1204, 286, 4673, 13, 1375, 7247, 326, 607, 3397, 547, 8179, 44681, 284, 262, 2476, 286, 511, 661, 13, 198, 198, 1722, 262, 8801, 16524, 1105, 267, 8801, 319, 607, 8208, 717, 10955, 1052, 459, 292, 494, 338, 1204, 355, 673, 2993, 340, 373, 645, 517, 290, 287, 663, 1295, 373, 257, 264, 585, 312, 1204, 286, 9253, 290, 37268, 13, 1081, 673, 43363, 284, 262, 2128, 286, 48508, 673, 4367, 326, 673, 373, 407, 287, 607, 2119, 4249, 607, 1363, 11, 673, 373, 11191, 416, 45694, 5538, 287, 257, 21151, 290, 3870, 527, 2119, 13, 632, 373, 287, 326, 2119, 673, 717, 2982, 262, 19490, 286, 8109, 326, 2107, 287, 11854, 290, 15974, 276, 2402, 262, 661, 286, 607, 1956, 13, 632, 373, 287, 326, 2119, 673, 373, 1297, 326, 262, 19490, 547, 407, 2499, 286, 10165, 475, 286, 3950, 13, 843, 340, 373, 287, 326, 2119, 673, 373, 1297, 673, 561, 319, 607, 8208, 6071, 10955, 673, 561, 307, 16862, 656, 511, 3592, 286, 35299, 286, 543, 607, 2988, 11, 1169, 11083, 11, 9153, 13, 198, 198, 2025, 459, 292, 494, 373, 2582, 4137, 284, 1716, 257, 23952, 290, 287, 262, 1115, 812, 326, 3940, 673, 373, 7817, 703, 284, 307, 262, 2818, 30135, 13, 1374, 284, 1494, 34065, 286, 9942, 11, 703, 284, 31297, 290, 12601, 12472, 379, 1755, 290, 925, 284, 16181, 1096, 262, 2106, 290, 8876, 286, 511, 3592, 13, 1052, 459, 292, 494, 1626, 262, 2278, 286, 257, 1178, 1790, 812, 49976, 422, 257, 1611, 2612, 276, 1200, 284, 257, 2415, 673, 645, 2392, 8018, 13, 383, 3592, 287, 867, 2842, 550, 13013, 511, 4661, 11, 673, 2627, 29541, 11, 379, 262, 1266, 716, 6864, 11, 4692, 290, 28371, 13992, 13, 1375, 373, 4988, 262, 2818, 23952, 11, 7387, 290, 20061, 621, 883, 1290, 4697, 621, 5223, 13, 4900, 612, 373, 530, 5766, 262, 3592, 550, 407, 17830, 329, 13, 1052, 459, 292, 494, 44990, 262, 3592, 11, 290, 2376, 35932, 607, 1641, 329, 477, 484, 550, 1760, 284, 607, 13, 198, 198, 2202, 1052, 459, 292, 494, 338, 8208, 5544, 10955, 220, 357, 2197, 1115, 812, 257, 23952, 828, 618, 673, 373, 8720, 636, 286, 262, 3592, 290, 10431, 284, 2666, 607, 13061, 673, 925, 607, 835, 284, 607, 4172, 16669, 290, 33850, 790, 14527, 415, 287, 262, 2636, 286, 1755, 11, 1390, 607, 2802, 290, 2988, 13, 2293, 16727, 607, 15827, 673, 11468, 262, 1499, 287, 257, 35322, 2230, 284, 1196, 1698, 262, 14515, 1290, 8978, 4588, 13, 220, 198, 198, 2025, 459, 292, 494, 468, 3377, 790, 614, 1201, 13226, 257, 339, 9099, 2569, 12263, 981, 11300, 262, 13342, 31121, 1688, 6754, 2995, 13, 1375, 468, 1775, 717, 1021, 220, 262, 40337, 286, 262, 25733, 11, 262, 5402, 415, 302, 1161, 11, 262, 48718, 5854, 290, 12925, 584, 36364, 2995, 13, 220, 198, 198, 2025, 459, 292, 494, 318, 4499, 290, 10787, 11, 3690, 262, 812, 673, 468, 17383, 290, 6596, 2402, 607, 13093, 290, 2184, 4678, 13, 317, 3288, 3554, 673, 318, 32042, 11, 20886, 290, 12653, 20312, 351, 257, 34218, 2000, 1838, 607, 257, 4958, 379, 262, 983, 13, 1375, 468, 17450, 1029, 287, 262, 23952, 18911, 11, 852, 257, 3665, 2888, 286, 262, 1966, 42676, 2184, 13, 1052, 459, 292, 494, 468, 2904, 1716, 257, 5527, 1597, 8580, 355, 673, 38777, 262, 22958, 11447, 23952, 1919, 1204, 3715, 11, 852, 262, 6613, 4870, 286, 262, 749, 9208, 290, 18192, 23952, 9784, 290, 2910, 29509, 13, 198, 198, 2514, 2687, 2045, 287, 422, 262, 2354, 1052, 459, 292, 494, 338, 1204, 561, 1283, 257, 4320, 11, 1719, 257, 3016, 15541, 2033, 286, 5129, 290, 1176, 2779, 11, 734, 14142, 11468, 4743, 654, 290, 257, 4988, 46734, 298, 12263, 13, 2102, 5179, 286, 812, 286, 2877, 257, 29541, 290, 48176, 1204, 423, 2077, 663, 284, 293, 2402, 607, 13, 2332, 1613, 4477, 284, 36334, 607, 790, 23137, 2589, 13, 317, 2060, 1808, 21160, 1473, 48134, 378, 1626, 607, 2000, 13, 198, 198, 11980, 314, 625, 262, 4647, 1716, 262, 845, 1048, 314, 44351, 262, 749, 287, 428, 995, 30, 198, 198, 35, 23444, 284, 5879, 284, 5223, 326, 673, 318, 2147, 588, 607, 2988, 1052, 459, 292, 494, 468, 900, 503, 284, 2251, 257, 1641, 329, 5223, 26, 530, 543, 673, 481, 48303, 290, 1805, 10597, 262, 886, 286, 607, 26805, 1204, 13, 1881, 543, 673, 481, 30981, 765, 284, 3292, 329, 220, 607, 1459, 1204, 286, 339, 9099, 1042, 543, 625, 262, 4647, 468, 2626, 663, 300, 5819, 13, 198, 198, 1722, 1052, 459, 292, 494, 4072, 5558, 2402, 257, 649, 7002, 523, 8849, 262, 3726, 286, 607, 35197, 5110, 1535, 13, 1081, 607, 2000, 11835, 427, 34387, 673, 4329, 3117, 284, 4802, 35198, 286, 3685, 11, 2042, 5269, 290, 40371, 986, 198, 198, 1174, 11828, 1174, 198, 198, 9, 51, 789, 278, 625, 749, 379, 718, 6, 17, 3625, 7331, 1052, 459, 292, 494, 12895, 689, 749, 351, 475, 257, 16086, 286, 607, 2537, 330, 9304, 2951, 11, 607, 41231, 7436, 442, 271, 18449, 3033, 787, 772, 262, 749, 4950, 286, 1466, 1254, 20577, 1626, 607, 4931]


The special tokens associated with these fields are **<|NAME_FIELD|>** and **<|DESCRIPTION|>**. These are used to create segment_ids after they are tokenized. The seperator token id is  **50256**. The characterlist will consist of entries like these with each of these information encapsulated in CharacterInfo datatype. A unique segment ID will be associated with each of the CharacterInfo.


Now, the scenes will follow a similar structure. You get the idea. Just one thing I'll like to mention (if anyone wants to explore in detail I'll be overjoyed and it will help in our understanding), is that the **special_tokens** for characters are just name-field and description. But for stories the special tokens are diverse and consists of: 

- addition
- person
- name_field
- description
- success_stakes
- failure_stakes
- weakness
- name_field
- description
- move
- conclusion
- establishment
- place
- obstacle
- subplot

amongst others. 

We can now move towards finetuning. 





In [66]:
from transformers import GPT2Tokenizer

text = "This is a sequence"

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
print(type(tokenizer))
x = tokenizer.encode(text, max_length=2)

print(len(x))  # 2

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<class 'transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer'>
2
