<a href="https://colab.research.google.com/github/10dimensions/large-notebook-repository/blob/master/%5BCoAuthor%5D_2_Identifying_authors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identifying authors of sentences

**Goal: Identify authors (writer, GPT-3, or both) of sentences in the final texts**

Steps
0. Preparation: Download and read CoAuthor
1. Apply operations
2. Classify authors
3. Populate `currentDoc` for all events (optional)

## 0. Preparation: Download and read CoAuthor

If you are not familiar with this step, please go through the steps in [1. Getting started](https://colab.research.google.com/drive/1nUGXP9l_jelbB4X65J0ivUvLgQz1RK1C?usp=sharing).

In [None]:
!wget https://cs.stanford.edu/~minalee/zip/chi2022-coauthor-v1.0.zip
!unzip -q chi2022-coauthor-v1.0.zip
!rm chi2022-coauthor-v1.0.zip

--2022-03-09 06:59:07--  https://cs.stanford.edu/~minalee/zip/chi2022-coauthor-v1.0.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49956179 (48M) [application/zip]
Saving to: ‘chi2022-coauthor-v1.0.zip’


2022-03-09 06:59:09 (40.9 MB/s) - ‘chi2022-coauthor-v1.0.zip’ saved [49956179/49956179]

replace coauthor-v1.0/e0435f4cf6fc435c872ffc5b66b66b0c.jsonl? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [None]:
import os
import json


def find_writing_sessions(dataset_dir):
    paths = [
        os.path.join(dataset_dir, path)
        for path in os.listdir(dataset_dir) 
        if path.endswith('jsonl')
    ]
    return paths


def read_writing_session(path):
    events = []
    with open(path, 'r') as f:
        for event in f:
            events.append(json.loads(event))
    return events


dataset_dir = './coauthor-v1.0'
paths = find_writing_sessions(dataset_dir)
events = read_writing_session(paths[0])

In [None]:
events = read_writing_session(paths[1])

## 1. Apply operations

Apply a list of operations in Quill.

* Retain: Keep the next number of characters, without modification
* Insert: Insert the specified content at the current location
* Delete: Delete the next number of characters

For more detail, please refer to [the Quill documentation](https://quilljs.com/docs/delta/#changes).

In [None]:
def apply_ops(doc, mask, ops, source):
    original_doc = doc
    original_mask = mask

    new_doc = ''
    new_mask = ''
    for i, op in enumerate(ops):

        # Handle retain operation
        if 'retain' in op:
            num_char = op['retain']

            retain_doc = original_doc[:num_char]
            retain_mask = original_mask[:num_char]

            original_doc = original_doc[num_char:]
            original_mask = original_mask[num_char:]

            new_doc = new_doc + retain_doc
            new_mask = new_mask + retain_mask

        # Handle insert operation
        elif 'insert' in op:
            insert_doc = op['insert']

            insert_mask = 'U' * len(insert_doc)  # User
            if source == 'api':
                insert_mask = 'A' * len(insert_doc)  # API

            if isinstance(insert_doc, dict):
                if 'image' in insert_doc:
                    print('Skipping invalid object insertion (image)')
                else:
                    print('Ignore invalid insertions:', op)
                    # Ignore other invalid insertions
                    # Debug if necessary
                    pass
            else:
                new_doc = new_doc + insert_doc
                new_mask = new_mask + insert_mask

        # Handle delete operation
        elif 'delete' in op:
            num_char = op['delete']

            if original_doc:
                original_doc = original_doc[num_char:]
                original_mask = original_mask[num_char:]
            else:
                new_doc = new_doc[:-num_char]
                new_mask = new_mask[:-num_char]

        else:
            # Ignore other operations
            # Debug if necessary
            print('Ignore other operations:', op)
            pass

    final_doc = new_doc + original_doc
    final_mask = new_mask + original_mask
    return final_doc, final_mask

In [None]:
def get_text_and_mask(events, event_id, remove_prompt=True):
    prompt = events[0]['currentDoc'].strip()

    text = prompt
    mask = 'P' * len(prompt)  # Prompt
    for event in events[:event_id]:
        if 'ops' not in event['textDelta']:
            continue
        ops = event['textDelta']['ops']
        source = event['eventSource']
        text, mask = apply_ops(text, mask, ops, source)

    if remove_prompt:
        if 'P' not in mask:
            print('=' * 80)
            print('Could not find the prompt in the final text')
            print('-' * 80)
            print('Prompt:', prompt)
            print('-' * 80)
            print('Final text:', text)
        else:
            end_index = mask.rindex('P')
            text = text[end_index + 1:]
            mask = mask[end_index + 1:]

    return text, mask

For each character in *texts*, there is a corresponding `P,` `A,` or `U` character in *masks*.
`P` indicates that the character is a part of the original prompt, `A` indicates that the character is written by API, and `U` indicates that it is written by User.
Therefore, a text and its corresponding mask always have the same length.

Take a look at the event-by-event printing of texts and masks to see what's going on.
You can see that at the end of the sentence, as the user inserts `\n` (event 4), `\n` (event 5) and `M` (event 6), `U` is appended at the end of the mask to indicate the three characters are written by User, i.e. `UUU` (event 6).

In [None]:
for i in range(7):
    text, mask = get_text_and_mask(events, i, remove_prompt=False)
    print(i, events[i]['eventName'])  # Event ID and name
    print('Text:', text)  # Current text
    print('Mask:', mask)  # Current mask
    print('-' * 60)

0 system-initialize
Text: All of the "#1 Dad" mugs in the world change to show the actual ranking of Dads suddenly.
Mask: PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
------------------------------------------------------------
1 cursor-backward
Text: All of the "#1 Dad" mugs in the world change to show the actual ranking of Dads suddenly.
Mask: PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
------------------------------------------------------------
2 cursor-forward
Text: All of the "#1 Dad" mugs in the world change to show the actual ranking of Dads suddenly.
Mask: PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
------------------------------------------------------------
3 text-insert
Text: All of the "#1 Dad" mugs in the world change to show the actual ranking of Dads suddenly.
Mask: PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP

To get the mask for the final text:

In [None]:
text, mask = get_text_and_mask(events, len(events), remove_prompt=False)
print(text)
print(mask)

All of the "#1 Dad" mugs in the world change to show the actual ranking of Dads suddenly.

Detective David Tapp, who is investigating the disappearance of Marybeth's husband, Joe, has a flashback to his own childhood. It was Tapp's own dad, who was drunk, that was responsible for an accident that left the boy with a facial scar. He wonders what is the old man's mug says. He bets it's very low in the rankings. Marybeth says that Joe was also always drunk and probably crashed his car while driving while intoxicated.

But here's the thing. Any mugs that owners were already dead are now blank. Joe's mug isn't blank at all. What's interesting, as Detective Tapp has figured out, "he's number one!" was written on it. "So Joe must still be alive" Tapp notes, and he runs out of the room to look for him.

"But how can be a drunk husband be the number 1 dad?" Detective Tapp contemplated. "It just doesn't make sense. How are the rankings decided?" There were thoughts running on Detective Tapp's mi

In [None]:
len(text), len(mask)

(1911, 1911)

## 2. Classify authors

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import collections
from nltk.tokenize import sent_tokenize


def identify_author(mask):
    if 'P' in mask:
        return 'prompt'
    elif 'U' in mask and 'A' in mask:
        return 'user_and_api'
    elif 'U' in mask and 'A' not in mask:
        return 'user'
    elif 'U' not in mask and 'A' in mask:
        return 'api'
    else:
        raise RuntimeError(f'Could not identify author for this mask: {mask}')


def classify_sentences_by_author(text, mask):
    sentences_by_author = collections.defaultdict(list)
    for sentence_id, sentence in enumerate(sent_tokenize(text.strip())):
        if sentence not in text:
            print(f'Could not find sentence in text: {sentence}')
            continue
        index = text.index(sentence)
        sentence_mask = mask[index:index + len(sentence)]
        author = identify_author(sentence_mask)
        sentences_by_author[author].append({
            'sentence_id': sentence_id,
            'sentence_mask': sentence_mask,
            'sentence_author': author,
            'sentence_text': sentence,
        })
    return sentences_by_author

In [None]:
text, mask = get_text_and_mask(events, len(events), remove_prompt=True)  # Set remove_prompt to be true
sentences_by_author = classify_sentences_by_author(text, mask)

In [None]:
sentences_by_author.keys()

dict_keys(['user_and_api', 'user', 'api'])

In [None]:
sentences_by_author

defaultdict(list,
            {'api': [{'sentence_author': 'api',
               'sentence_id': 14,
               'sentence_mask': 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA',
               'sentence_text': 'How are the rankings decided?"'},
              {'sentence_author': 'api',
               'sentence_id': 21,
               'sentence_mask': 'AAAAAAAAAAAAAAAAAAAAA',
               'sentence_text': 'Detective Tapp asked.'}],
             'user': [{'sentence_author': 'user',
               'sentence_id': 2,
               'sentence_mask': 'UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU',
               'sentence_text': "He wonders what is the old man's mug says."},
              {'sentence_author': 'user',
               'sentence_id': 3,
               'sentence_mask': 'UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU',
               'sentence_text': "He bets it's very low in the rankings."},
              {'sentence_author': 'user',
               'sentence_id': 5,
               'sentence_mask': 'UUUU

## 3. Populate `currentDoc` for all events (optional)

You might have noticed that the `currentDoc` field is empty by default for all events, unless it is the `system-initialize` event. This is intended as saving `currentDoc` for every event is redundant and easily results in a huge file! However, having this redundancy could be convenient at times. 

In order to populate `currentDoc` for every event, do the following:

In [None]:
events[0], events[1]

({'currentCursor': 257,
  'currentDoc': 'Following World War III, all the nations of the world agreed to 50 years of strict isolation from one another in order to prevent additional conflicts. 50 years later, the United States comes out of exile, only to learn that no one else went into isolation.\n',
  'currentFrequencyPenalty': '1',
  'currentHoverIndex': '',
  'currentMaxToken': '30',
  'currentN': '5',
  'currentPresencePenalty': '0',
  'currentSuggestionIndex': 0,
  'currentSuggestions': [],
  'currentTemperature': '0.75',
  'currentTopP': '1',
  'cursorRange': '',
  'eventName': 'system-initialize',
  'eventNum': 0,
  'eventSource': 'api',
  'eventTimestamp': 1630521397737,
  'textDelta': ''},
 {'currentCursor': 0,
  'currentDoc': '',
  'currentFrequencyPenalty': '1',
  'currentHoverIndex': '',
  'currentMaxToken': '30',
  'currentN': '5',
  'currentPresencePenalty': '0',
  'currentSuggestionIndex': 0,
  'currentSuggestions': [],
  'currentTemperature': '0.75',
  'currentTopP': '

In [None]:
import copy


def populate_currentdoc(events):
    prompt = events[0]['currentDoc'].strip()

    text = prompt
    mask = 'P' * len(prompt)  # Prompt
    
    events_with_currentdoc = copy.deepcopy(events)
    for i, event in enumerate(events):
        if 'ops' in event['textDelta']:
            ops = event['textDelta']['ops']
            source = event['eventSource']
            text, mask = apply_ops(text, mask, ops, source)
        events_with_currentdoc[i]['currentDoc'] = copy.deepcopy(text)

    return events_with_currentdoc

In [None]:
events_with_currentdoc = populate_currentdoc(events)

In [None]:
events_with_currentdoc[0], events_with_currentdoc[1]

({'currentCursor': 89,
  'currentDoc': 'All of the "#1 Dad" mugs in the world change to show the actual ranking of Dads suddenly.',
  'currentFrequencyPenalty': '1',
  'currentHoverIndex': '',
  'currentMaxToken': '30',
  'currentN': '5',
  'currentPresencePenalty': '0',
  'currentSuggestionIndex': 0,
  'currentSuggestions': [],
  'currentTemperature': '0.75',
  'currentTopP': '1',
  'cursorRange': '',
  'eventName': 'system-initialize',
  'eventNum': 0,
  'eventSource': 'api',
  'eventTimestamp': 1630595640315,
  'textDelta': ''},
 {'currentCursor': 73,
  'currentDoc': 'All of the "#1 Dad" mugs in the world change to show the actual ranking of Dads suddenly.',
  'currentFrequencyPenalty': '1',
  'currentHoverIndex': '',
  'currentMaxToken': '30',
  'currentN': '5',
  'currentPresencePenalty': '0',
  'currentSuggestionIndex': 0,
  'currentSuggestions': [],
  'currentTemperature': '0.75',
  'currentTopP': '1',
  'cursorRange': {'index': 73, 'length': 0},
  'eventName': 'cursor-backward'