## CLARIN-EHRI Workshop, Prague March 27-28, 2024

### Introduction

**Workshop Title:**
*The Geography of **fear**: Exploring the Emotional
Landscapes in the Holocaust Survivors’ Testimonies*

**Background:** Holocaust survivors' testimonies offer a profound insight into the individual experiences endured during the Nazi genocide. They reveal emotional connections to places, events, and memories, forming what is known as emotional geography. This field helps in understanding the interplay of emotions such as fear, anger, surprise, disgust, and joy across various locations and times. Extracting and analyzing these emotions from vast textual data collections is challenging, but this work aims to explore this possibility using natural language processing techniques.



Let's start by getting into our working with the command:

```python
cd clarin_ehri_prague_2024
```

Uncomment (i.e. delete the `#` symbol) below and execute the command

In [4]:
cd /content/drive/MyDrive/clarin_ehri_prague_2024

/content/drive/MyDrive/clarin_ehri_prague_2024


In [35]:
# Install and import libraries
!pip -q install geonamescache # helps us acces the geonames list
!python -q -m spacy download en_core_web_trf # use the spacy transformer model
from geonamescache import GeonamesCache as gc
import os, shutil, re
from collections import defaultdict, OrderedDict, Counter
from IPython.display import HTML
import pandas as pd
import nltk
import spacy
# from spacy import displacy
from nltk import sent_tokenize
from transformers import pipeline

nltk.download('punkt')

### Task 1: Read and process testimony file

We used data from the **[Holocaust Survivors' Testimonies](https://vha.usc.edu/home)** dataset. They contain the question-answer pairs from interviews granted by holocaust survivors as well as other annotations `emotion` (negative and positive words) with sentiment scores, `city`, `camp`, `geonoun`, `expression`, and `other_language`.

For this workshop, we will annotate only 10 randomly testimonies each of:
- `268.txt`, `36999.txt`, `37210.txt`, `37250.txt`, `37409.txt`, `37556.txt`, `37567.txt`, `37585.txt`, `37605.txt`, `37648.txt`.
- They are contained in the `data` folder

Uncomment below to use the `ls` command to list the files in the `data` folder:


In [2]:
# ls data

#### But what does as VHA testimony file look like?

Let's define a function `readfile` as below to take a testimony file name and read it into memory in a neat way removing the blank lines.

```python
readfile = lambda fname: [line.strip() for line in open(f'data/{fname}').readlines() if line.strip()]
```

In [34]:
readfile = lambda fname: [line.strip() for line in open(f'data/{fname}').readlines() if line.strip()]
# readfile('268.txt')

In [6]:
# @title **Exercise 1:** Can you write the command to read file `37210.txt`?

# Delete me and write your code...

### Task 2: Splitting the interview `questions` and `answers`
We may also separate the interviews questions from the survivor's responses and present them in a data frame. Let's define the functions for splitting the `questions and `answers` from testimony files

###### Defining the functions

In [36]:
def get_survivor_initial(filename):
  initials=[] #list for storing all possible initials
  testimony=readfile(filename)
  for line in testimony:
    #search through each line and append any pattern that looks like an initial
    m = re.search('\w*:', line)
    if m: initials.append(m.group())
  #from the most common 2 pick the one that is not the interviewer
  return testimony, [initial for initial, _ in Counter(initials).most_common()[:2] if initial!='INT:'][0]

def get_questions_and_answers(filename):
  testimony, initial = get_survivor_initial(filename) # returns the initial of the speaker
  # split the interviews based on the interviewers questions or promptings
  qas = ['INT: '+qa for qa in ' '.join(testimony[2:]).split('INT: ')]
  # return pairs of question/promptings and answers/responses from the survivors
  questions, answers = list(zip(*[(qa.split(initial)[0],initial+qa.split(initial)[1]) for qa in qas if len(qa.split(initial))==2]))
  return pd.DataFrame.from_dict({'fileID':[filename[:-4]]*len(questions), 'questions':questions, 'answers':answers})

##### Processing testimony files
Do not bother about the functions but if you are curious, you can look click on the `Show code` and have a peep.

But we'll use one of the functions `get_questions_and_answers()` function to transform a testimony into a dataframe with columns for `questions` and `answers`.


In [37]:
get_questions_and_answers('268.txt')

Now use the same function we used above to transform testimony `37250.txt` into a dataframe with questions and answers

In [9]:
# @title **Exercise 2:** Transform file `37250.txt` to a dataframe?

# Delete me and write your code...

### Task 3: Split `answers` into sentences

Some responses are just too long and span many sentences e.g.response index 83 has 24135 characters or 4687 tokens. To confirm this, you can run the code below
```python
index_token_size = {i:len(answer.split()) for i, answer in enumerate(testimony_268['answers'])}
max(index_token_size, key=index_token_size.get)
```
followed by
```python
print(f"""characters: {len(testimony_268['answers'][83])}
tokens: {len(testimony_268['answers'][83].split())}""")
```
It will therefore be better to segment the answers further into sentences for better processing.

In [13]:
testimony_268_qas = get_questions_and_answers('268.txt')

In [14]:
fileIds, answerIds, sentences = [],[],[]
for i in range(len(testimony_268_qas)):
  fileID, questions, answers = testimony_268_qas.iloc[i]
  sents = sent_tokenize(testimony_268_qas.answers[i][4:])
  fileIds.extend([fileID]*len(sents))
  answerIds.extend([i]*len(sents))
  sentences.extend(sents)
testimony_268_sents = pd.DataFrame.from_dict({'fileID':fileIds, 'answerID':answerIds, 'sentences':sentences})

### Task 4: Identifying places and other entities in the testimony

##### Let's start by importing the `en_core_web_trf` model and adding the GEONOUN patterns

In [44]:
# import and load the spacy web transformer model
import en_core_web_trf
nlp = en_core_web_trf.load()
nlp.add_pipe('merge_entities')

# Add the `entity_ruler` to the pipeline before the NER module
ruler = nlp.add_pipe("entity_ruler", before='ner')

# add patterns for label `CITY`, COUNTRY, CONTINENT, GEONOUN
patterns =  [{"label": "GEONOUN", "pattern": noun} for noun in open('combined_geonouns.txt').read().strip().split('\n')]
ruler.add_patterns(patterns)

Then we use the `PlaceNames` and `Annotator` classes below to define our placenames and other entities as well as perform annotations. You can view the code to seehow it works or just trust me and use it as it is 😀


In [47]:
sortbylen = lambda lst: sorted(set(lst), key=lambda v:len(v), reverse=True)
class PlaceNames:
    def __init__(self):
      self.resources_url= "https://raw.githubusercontent.com/SpaceTimeNarratives/demo/main/resources/"
      self.__download_resources()
      self.additional_cities = ['New York'] #cities not in GeoNames or Aliases
      self.additional_countries = ['America', 'the United States','Czechoslovakia'] #countries not in GeoNames or Aliases
      self.cities, self.city_names = self.__get_cities()
      self.us_states, self.us_state_names = self.__get_us_states()
      self.countries, self.country_names = self.__get_countries()
      self.continents, self.continent_names = self.__get_continents()
      self.camps = self.__get_camps()
      self.geonouns = self.__get_geonouns()
      self.ambiguous_cities = self.__get_ambiguous_cities()
      # self.sentiment_scores  = None
      # self.emotion_scores    = None

    # city details and names
    def __get_cities(self):
      __cities = {i:{'geonameid':detail['geonameid'], 'name':detail['name'].replace("'",'’'),
             'latitude':float(detail['latitude']), 'longitude':float(detail['longitude']),
             'countrycode':detail['countrycode']} for i, (_, detail) in enumerate(gc().get_cities().items())}
      __names = [city['name'] for _, city in __cities.items()]
      __names.extend(self.additional_cities)
      return __cities, sortbylen(__names)

    # US states details and names
    def __get_us_states(self):
      __us_states = {i:{'geonameid':detail['geonameid'],'name':detail['name'].replace("'",'’'),'code':detail['code']}
              for i, (_, detail) in enumerate(gc().get_us_states().items())}
      __names = sortbylen([us_state['name'] for _, us_state in __us_states.items()])
      return __us_states, __names

    # country details and names
    def __get_countries(self):
      __countries = {i:{'geonameid':detail['geonameid'], 'iso': detail['iso'], 'name':detail['name'].replace("'",'’'),
                'capital':detail['capital'].replace("'",'’'), 'continentcode':detail['continentcode'], 'neighbours':detail['neighbours']}
                for i, (_, detail) in enumerate(gc().get_countries().items())}
      __names = [country['name'] for _, country in __countries.items()]
      __names.extend(self.additional_countries)
      return __countries, sortbylen(__names)

    # continent details and names
    def __get_continents(self):
      __continents = {i:{'geonameid':detail['geonameId'], 'name':detail['name'].replace("'",'’'), 'continentcode':detail['continentCode'],
                 'bbox_north':detail['bbox']['north'], 'bbox_south':detail['bbox']['south'], 'bbox_east':detail['bbox']['east'],
                 'bbox_west':detail['bbox']['west']}  for i, (_, detail) in enumerate(gc().get_continents().items())}
      __names = sortbylen([continent['name'] for _, continent in __continents.items()])
      return __continents, __names

    # ---------Other resources------------
    # Download resource file()
    def __download_resources(self):
      for res in ['cleaned_holocaust_camps.txt','combined_geonouns.txt','ambiguous_cities.txt']:
        if not os.path.exists(res):
          os.system(f"wget -q {self.resources_url}{res}")
          print(f"{res} successfully downloaded.")

    def __read_source_file(self, source_file):
      return open(source_file).read().strip().split('\n')

  # Concentration camps
    def __get_camps(self, srcfile=None):
      source_file = srcfile if srcfile else 'cleaned_holocaust_camps.txt'
      __camps = self.__read_source_file(source_file)
      if __camps: return sortbylen([name for name in __camps if name not in [country['name']
                                                for _, country in self.countries.items()]])
      else:
        print(f"Error: Reading file '{source_file}'.")
        return None

  # Geographical feature names
    def __get_geonouns(self, srcfile=None):
      source_file = srcfile if srcfile else 'combined_geonouns.txt'
      __geonouns = self.__read_source_file(source_file)
      if __geonouns: return sortbylen(__geonouns)
      else:
        print(f"Error: Reading file '{source_file}'.")
        return None

  # Get ambiguous cities
    def __get_ambiguous_cities(self, srcfile=None):
      source_file = srcfile if srcfile else 'ambiguous_cities.txt'
      __ambiguous_cities = self.__read_source_file(source_file)
      if __ambiguous_cities: return sortbylen(__ambiguous_cities)
      else:
        print(f"Error: Reading file '{source_file}'.")
        return None

  # Check Ambiguous Cities
    isCityAmbiguous = lambda self, city: (True, [_city for _, _city in self.cities.items() if _city['name'].lower() == city.lower()]
                                          ) if city in self.ambiguous_cities else (False,f"{city} is not ambiguous: {[_city for _, _city in self.cities.items() if _city['name'].lower() == city.lower()][0]}")
class Annotator(PlaceNames):
    def __init__(self, **kwargs): #kwargs = ['text', 'model']
      super().__init__()
      # self.file           = kwargs['file'] if 'file' in kwargs else None ##FIX LATER
      self.text             = kwargs['text'] if 'text' in kwargs else None
      self.emotion_model    = kwargs['emotion_model'] if 'model' in kwargs else None
      self.sentiment_model  = kwargs['sentiment_model'] if 'model' in kwargs else None
      self.entity_tags      = ['CONTINENT', 'COUNTRY', 'US-STATE', 'CITY', 'CAMP',
                               'DATE','TIME','GEONOUN']
      self.entities = self.__get_entities(self.text) if self.text else None
      self.output_dir ='output'
      self.__BG_COLOR={'CITY':'#feca74','COUNTRY':'#f0b6de','CONTINENT':'#e4e7d2','US-STATE':'#feca74',
                       'CAMP':'#b3d6f2','GEONOUN': '#9cc9cc','DATE':'#c7f5a9', 'TIME':'#a9f5bc',
                       'PLACE':'#e4e7d2', 'EVENT':'#e0aedd'}
    # merging two entities
    def __merge_entities(self, first_ents, second_ents):
      return dict(OrderedDict(sorted({**second_ents, **first_ents}.items())))

    # merging two entities
    def __join_near_similar_ents(self, ent_dict, tag):
      return {i:(ent[0]+' '+ent_dict[i+len(ent[0])+1][0], tag)
              for i, ent in ent_dict.items() if ent[1]==tag and i+len(ent[0])+1 in ent_dict}

    # extract entities from text
    def __get_entities(self, text=None):
      if text: self.text = text
      if self.text: doc = nlp(self.text)
      else: return f"Error: 'Annotator' has no text to process!"

      __ent_details = {token.idx:(self.text[token.idx:token.idx+len(token)],
         token.ent_type_, token.pos_) for token in doc if token.ent_type_ in
          ['FAC','GPE','LOC','DATE','TIME','EVENT','GEONOUN']}

      # enforce only 'GEONOUNS' pos-tagged as 'NOUN'
      __ent_details= {i:detail for i, detail in __ent_details.items() if detail[:2]!='GEONOUN' or (detail[:2]=='GEONOUN' and detail[:3]=='NOUN')}

      #join near similar ents e.g. "concentration:GEONOUN", "camp:GEONOUN" --> "concentration camp:GEONOUN"
      __ent_details= self.__merge_entities(self.__join_near_similar_ents(__ent_details, 'GEONOUN'), __ent_details)

      return {i:self.__convert_place_entities(detail[:2]) for i, detail in __ent_details.items()}

    def __convert_place_entities(self, place):
      name, tag = place
      if tag in ['FAC','GPE','LOC']:
        if name in self.continent_names: return name, 'CONTINENT'
        elif name in self.country_names: return name, 'COUNTRY'
        elif name in self.us_state_names: return name, 'US-STATE'
        elif name in self.city_names: return name, 'CITY'
        elif name in self.camps: return name, 'CAMP'
        else: return name, 'PLACE'
      return name, tag

    def __get_tagged_list(self, text, __ent_details):
      entities = {i:self.__convert_place_entities(detail[:2]) for i, detail in __ent_details.items()}
      begin, tokens_tags = 0, []
      for start, (ent, tag) in entities.items():
        if begin <= start:
          tokens_tags.append((text[begin:start], None))
          tokens_tags.append((text[start:start+len(ent)], tag))
          begin = start+len(ent)
      tokens_tags.append((text[begin:], None)) #add the last untagged chunk
      return tokens_tags

    def __mark_up(self, token, tag=None):
      if tag:
        begin_bkgr = f'<bgr class="entity" style="background: {self.__BG_COLOR[tag]}; padding: 0.1em 0.1em; margin: 0 0.15em; border-radius: 0.23em;">'
        end_bkgr = '\n</bgr>'
        begin_span = '<span style="font-size: 0.8em; font-weight: bold; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
        end_span = '\n</span>'
        return f"{begin_bkgr}{token}{begin_span}{tag}{end_span}{end_bkgr}"
      return f"{token}"

    def visualize(self):
      token_tag_list = self.__get_tagged_list(self.text, self.entities)
      start_div = f'<div class="entities" style="line-height: 2.0; direction: ltr">'
      end_div = '\n</div>'
      html = start_div
      for token, tag in token_tag_list:
        html += self.__mark_up(token,tag)
      html += end_div
      return HTML(html)

In [None]:
text = testimony_268_qas.answers[7]
text

In [64]:
annotator = Annotator(text=text)

In [None]:
#@title We can list the entities we have extracted...
annotator.entities

In [None]:
#@title ...or even visualize them.
annotator.visualize()

  ### Task 4: Emotion Classification

For emotions we will use this transformer model here: [j-hartmann/emotion-english-distilroberta-base](j-hartmann/emotion-english-distilroberta-base)

In [68]:
# testimony_268_sents

In [80]:
from transformers import pipeline
# load pre-trained emotion classification model
model_path = "j-hartmann/emotion-english-distilroberta-base"
model = pipeline("text-classification", model=model_path, tokenizer=model_path,
                        max_length=512, truncation=True)

In [94]:
for i in range(20):
  testimony = testimony_268_sents.sentences[i]
  score = model(testimony)
  print(f"{testimony}\n- {score[0]['label']}, {score[0]['score']}")
  print()

My name is Henry Rosmarin.
- neutral, 0.9009859561920166

Yes.
- neutral, 0.8983097672462463

R-O-S-M-A-R-I-N.
- neutral, 0.7591530680656433

At birth it was Henryk, H-E-N-R-Y-K Rozmaryn, R-O-S-M-A-R-Y-- I'm sorry, English spelling.
- sadness, 0.7981343269348145

Polish spelling, R-O-Z-M-A-R-Y-N, Rozmaryn.
- neutral, 0.8772909641265869

October 7th, 1925.
- neutral, 0.3719763457775116

My present age is 73.
- sadness, 0.6599793434143066

I was born in a little town called Czeladz, in Poland.
- joy, 0.9255752563476562

It's southwestern corner of Poland.
- neutral, 0.9070968627929688

I spell it for you, C-- capital C-Z-E-L-A-D-Z.
- neutral, 0.935821533203125

The nearest town was Bedzin, Sosnowiec.
- neutral, 0.7137645483016968

And towards the German border was Katowice and Siemianowice.
- neutral, 0.5386792421340942

Incidentally, it was in Siemianowice that I lived.
- neutral, 0.7330958843231201

Actually, I was an infant.
- fear, 0.4007508158683777

My mom told me she went to Czela

In [None]:
# @title **Emotion 10**: Define `filenames` from the working directory
# emotion_10_zip_file = '/content/drive/MyDrive/UCREL/demo/resources/ht_resources/data/emotion_scores_10.zip'
!wget -c = 'https://github.com/SpaceTimeNarratives/demo/raw/main/resources/emotion_scores_10.zip'
shutil.unpack_archive(emotion_10_zip_file)
filenames = ['268', '36999', '37210', '37250', '37409', '37556', '37567', '37585', '37605', '37648']
file_paths = [f'emotion_scores/{f}_emotion_scores.xlsx' for f in filenames if os.path.exists(f'emotion_scores/{f}_emotion_scores.xlsx')]
os.listdir()

In [98]:
!wget -c = 'https://github.com/SpaceTimeNarratives/demo/raw/main/resources/emotion_scores_10.zip'

--2024-03-28 11:59:24--  http://=/
Resolving = (=)... failed: Name or service not known.
wget: unable to resolve host address ‘=’
--2024-03-28 11:59:24--  https://github.com/SpaceTimeNarratives/demo/raw/main/resources/emotion_scores_10.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/SpaceTimeNarratives/demo/main/resources/emotion_scores_10.zip [following]
--2024-03-28 11:59:24--  https://raw.githubusercontent.com/SpaceTimeNarratives/demo/main/resources/emotion_scores_10.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

