<h1> Pre-processing 2 </h1>

Involves:
- Remove reports not written in English
- Lowercasing and punctuation 
- Remove named entities (for anonymity) and dates
- Remove certain 2-3word time phrases
- Tokenization

In [1]:
#imports
import pandas as pd
import pickle
import re
import nltk
import nltk.data
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from collections import Counter
import spacy
from langdetect import detect

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aksel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
#import data after pre-processing 1
df = pd.read_excel("processed_data_1.xlsx")

<h3> Remove reports not written in English </h3>

In [None]:
#identify index of non-english reports
for i, text in enumerate(df.text):
    if detect(text) != "en":
        #remove those rows
        df.drop(i, axis = 0, inplace = True)

#reset index
df.reset_index(drop=True, inplace=True)

<h3> Lowercasing & Punctuation </h3>

Lowercasing makes it less likely that word at the beginning of a sentence (e.g. "Reconstituting and ...") are identified as named entities. 

Remove punctuation, except [. , '].

In [5]:
def lowercasing_punctuation(string):
    data = nltk.word_tokenize(string)  # tokenize string to words
    data = [ ch.lower() for ch in data
             if ch.isalpha()
             or ch in [".", ","]
           ]
    data = " ".join(data) # convert back to string
    return data

df.text = df.text.loc[:].apply(lowercasing_punctuation)

<h3> Anonymize named entitiies </h3>

For anonymity, remove the following and replace with their named entity type.

- PERSON (Names of people)
- ORG (Organisations) 
- GPE (Geopolitical entity)
- LOC (Location)
- DATE (Time of date)

DATE is included above, as terms such as '5th of February', or 'last week' often occur in sentences with a time seed word. They tend not to focus on issues of time duration perception. To make it easier to identify these sentences and exclude them from BERTopic analyses, they are replaced with the placeholder 'DATE'.



In [None]:
remove_list=[]

nlp = spacy.load('en_core_web_lg')

for i, text in enumerate(df.text):
    doc = nlp(text)
    # Replace named entities with their entity type
    new_tokens = []
    for token in doc:
        if token.ent_type_ in ["PERSON", "ORG", "GPE", "LOC", "DATE"] and token.text != "time":
            remove_list.append(token.text)
            new_tokens.append(token.ent_type_)
        else:
            new_tokens.append(token.text)

    # Join the new tokens into a single string
    df.loc[i, "text"] = " ".join(new_tokens)

<h3> Remove specific 2-3 word phrases containg time words </h3>

Replace phrases such as 'first time', 'point in time', or 'long story' with 'PLACEHOLDER', as these phrases are describing points in time or idioms unrelated to time perception. 

<br>

**Note**: Whether to include sentinment-based idioms of time phrases could be debated ('best time', 'wonderful time').

In [27]:
#remove 3-word phraes (e.g. "all in time")
two_word_prefix = ['all in', 'all the', 'around about', 'around such', 'around the', 'at a', 'at about', 'at any', 'at such', 'at the', 'by the', 'for a', 'for any', 'for such', 'for the', 'in a long', 'it was', 'one more', 'pass the', 'point in', 'the only', 'time to', 'until any', 'until the', 'was a', 'was about', 'was in', 'was the', 'waste of', 'when in']

def three_word_phrases(string):
    for expr in two_word_prefix: 
        regex = r"\b({})\s+time\b".format(expr)
        string = re.sub(regex, r"PLACEHOLDER PLACEHOLDER PLACEHOLDER", string)
    return string
    
df.text = df.text.loc[:].apply(three_word_phrases)


    

#for 2-word phrases with various seed words, it's easier to tokenize the data first
def tokenize_words(string):
    data = nltk.word_tokenize(string)  # tokenize string to words
    return data

df.text = df.text.loc[:].apply(tokenize_words)


#remove 2-word phrases 

#e.g. "first time"
preword_dict = {"time":  ["first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eigth", "nineth", "tenth", "whole", "this", "every", "each", "that", "next", "one", "good", 'bad', "same", "last", "hard", "great", "entire", "some", "current", "single", "my", "winter", "summer", "spring", "dinner", "wonderful", "night", 'right', 'have', 'had', "best", 'awful', 'worst', "free", "quality", 'another', 'popular'],
                "short": ["in", "stopped", 'running'],
                'long':  ["not"],
                'rate':  ["to", "I", "would", "any"],
                'length':["in"],
                }
#e.g. "short story"
postword_dict = {
                "time" : [],
                "short": ["story", "sentences", 'phrase', 'version', 'versions'],
                'long':  ["story", "pants", "after", "gone", "before", 'walk', 'walks', 'hair'],
                'rate':  [],
                'length':  ["in"]
                }

#replace with "PLACEHOLDER"
for key in preword_dict:
    for text in df.text[:]:
        for i, word in enumerate(text[:-3]):
            if key == word and text[i-1] in preword_dict[key]:
                text[i] = "PLACEHOLDER"
                text[i-1] = "PLACEHOLDER"
            elif key == word and text[i+1] in postword_dict[key]:
                text[i] = "PLACEHOLDER"
                text[i+1] = "PLACEHOLDER"


In [29]:
#save
df.to_pickle("processed_data_2.pkl")