## Character Relationships Workflow

#### 1. Split into chapters

First, the book is split into chapters as the memory requirement for analysing the entire book at once is too large. 

In our case, the key delimiter to split is 6 consecutive newline characters `\n` and the title of the chapter is taken as the first line of each block of text. The chapter splitting may not be perfect and some manual adjustments may be necessary.

The `book` variable is the name (without file extension) of the `.txt` text file of the book. In this case, *Worm*'s text is stored in `worm.txt` in the `text` directory.

In [None]:
import os

from analysis import split_chapters

text_dir = 'text'
book = 'worm'

split_chapters(text_dir, book)

#### 2. Conduct Named Entity Recognition and Coreference Resolution using BookNLP 

The BookNLP library was used to conduct Named Entity Recognition and Coreference Resolution (matching characters and tracking when they are present even if pronouns are used).

To install BookNLP, run the following on CLI.
```bash
pip install booknlp
python -m spacy download en_core_web_sm
```

In [None]:
from analysis import run_ner_coref

ner_coref_data_dir = os.path.join('booknlp_output', book)
text_dir = os.path.join('text', book)

run_ner_coref(text_dir, ner_coref_data_dir)

#### 3. Searching and Matching Main Characters

BookNLP provides a list of characters in each of the chapters. The next task is to match them across the chapters for the entire text.

Firstly, BookNLP refers to tries to refer to the identified characters with proper nouns if possible, (followed by a common noun, then a pronoun). The nouns were matched across the chapters.

In [None]:
from analysis import get_main_char

characters_data_dir = os.path.join('characters', book)

get_main_char(ner_coref_data_dir, characters_data_dir)

Now, a bit of manual work is needed. Some characters have aliases or titles, and sometimes they are referred to with or without their surnames. The more common nouns were tracked and matched based on knowledge of the text (having read the book helps).

The most common characters can be viewed from the `main_characters.json` file in the `characters_data_dir` directory. Adjust `TOP_N_CHAR` accordingly.

In [None]:
import os
import json


TOP_N_CHAR = 200


with open(os.path.join(characters_data_dir, "main_characters.json"), 'r') as file:
    main_characters_data = json.load(file)

main_characters_data = dict(sorted(main_characters_data.items(), key = lambda x: x[1]["count"], reverse=True))

count = 0

for name, det in main_characters_data.items():
    if count == TOP_N_CHAR: 
        break

    print(name)
    count += 1

The `main_character_aliases.json` file will have to be created and stored in the `characters_data_dir`. An example  of the file is as follows:

```json
[
    ["Name of Person 0", "Alias 1 of Person 0", "Alias 2 of Person 0"],
    ["Name of Person 1", "Alias 1 of Person 1"],
    ["Name of Person 2"]
]
```

Each character can have any number of aliases, so long as it is more than 1. The strings in the list must be an exact match with the nouns in the `main_characters.json` list.

Now, the main characters can be matched across the chapters.

In [None]:
from analysis import consolidate_main_char

consolidate_main_char(characters_data_dir)

Note: the `consolidated_id` in the newly generated `main_characters_consolidated.json` file will be the index of the character in the list manually created in `main_character_aliases.json`. That character's names and aliases are in the corresponding index.

#### 4. Linking Sentences to Entities

The next step is to associate the all sentences in the book with characters that are mentioned in them. Each character is referenced by their `consolidated_id`.

Some characters might be wrongly consolidated or recognised due to sharing similar names. To circumvent this issue, a custom filter function can be written. For every occurence of a character in each chapter, the `novel_id` of the character (refering `main_characters.json`), the chapter name, and any other arguments is passed into the filter function. If the function returns `True`, the `consolidated_id` is fetched. If not, the character occurence is ignored.

The format of the filter function is as follows:
```py
def filter_function(novel_id: int, chapter_name: str, filter_args: any) -> bool:
    """CUSTOM filter function - write your own logic here"""

    # If character is filtered out
    return False

    # Else
    return True 
```

Below is an example of the filter function used for Worm. In Worm, the narrator (with a `novel_id` of 0) does not appear in any interludes, the epilogues (Teneral) or the Migration and Sentinel arcs as those are not narrated in first person. Therefore, any `novel_id` of 0 is definitely incorrect. However, the narrator's other names such as 'Taylor' (113), or 'Skitter' (6) could appear and would refer to the character. Therefore, only the `novel_id` of 0 is excluded.

In [None]:
def filter_func(novel_id: int, chapter: str, _: any):
    
    # NARRATOR Should not appear in these chapters! So filter function will return `False` if the conditions are fulfilled
    return not (novel_id == 0 and ("Interlude" in chapter or "Teneral" in chapter or "Migration" in chapter or "Sentinel" in chapter))

A `relevant_sentences.csv` file is generated under the directory of each chapter in the output directory of BookNLP (`ner_coref_data_dir`).

In [None]:
from analysis import get_relevant_sentences_in_book

get_relevant_sentences_in_book(
    ner_coref_data_dir,
    text_dir,
    characters_data_dir,
    filter_func,
)

The above `filter_func` did not use any additional arguments. To inject additional information (maybe a list of chapters or a dictionary) into the filter function for the filtering logic, simply pass them as arguments into the `get_relevant_sentences_in_book` function.

```py
get_relevant_sentences_in_book(
    ner_coref_data_dir,
    text_dir,
    characters_data_dir,
    filter_func,
    filter_args # Can by any type
)
```

#### 5. Sentiments Analysis

The final step is to conduct sentiment analysis on the corpus. Each sentence is analysed with the AFINN Sentiment Analysis library, which is installed via PIP.
```bash
pip install afinn
```

The sentences are iterated through. If 2 or more characters are mentioned in the sentence, the sentiment score from the `Afinn.score` method will be added to the interaction array between each pair of characters.

With `n` main characters , a `n * n * 2` interaction array is created, with `arr[c1][c2][0]` denoting the totalled sentiment score between characters with consolidated ids`c1` and `c2` and `arr[c1][c2][1]` deonting the number of times the characters interacted.

In the `collate_relations` function, the parameter `deduct_opposing_avg` is set to `True`. This deducts from the sentiment score in `arr[c1][c2][0]` the average sentiment of interactions of character `c2`. This normalisation process appears to better reflect the actual relationships between characters but is not definitive and can be turned off at users' discretion.   

In [None]:
from analysis import analyse_sentiments, collate_relations

analyse_sentiments(ner_coref_data_dir, characters_data_dir)
collate_relations(characters_data_dir, deduct_opposing_avg = True)

The file `interactions.json` will be created in the provided `characters_data_dir`, containing the above interactions array. The file linking consolidated ids to character names and aliases is `main_characters_aliases.json` in the same directory.