<h1> Pre-processing 2 </h1>

Involves:
- Remove reports not written in English
- Lowercasing and punctuation 
- Remove named entities (for anonymity) and dates
- Remove certain 2-3word time phrases
- Tokenization

In [None]:
#imports
import pandas as pd
import pickle
import re
import nltk
import nltk.data
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from collections import Counter
import spacy
from langdetect import detect

In [19]:
#import data after pre-processing 1
df = pd.read_excel("processed_data_1.xlsx")

<h3> Remove reports not written in English </h3>

In [20]:
#identify index of non-english reports
for i, text in enumerate(df.text):
    if detect(text) != "en":
        #remove those rows
        df.drop(i, axis = 0, inplace = True)

#reset index
df.reset_index(drop=True, inplace=True)

<h3> Lowercasing & Punctuation </h3>

Lowercasing makes it less likely that word at the beginning of a sentence (e.g. "Reconstituting and ...") are identified as named entities. 

Remove punctuation, except [. , '].

In [21]:
def lowercasing_punctuation(string):
    data = nltk.word_tokenize(string)  # tokenize string to words
    data = [ ch.lower() for ch in data
             if ch.isalpha()
             or ch in [".", ","]
           ]
    data = " ".join(data) # convert back to string
    return data

df.text = df.text.loc[:].apply(lowercasing_punctuation)

<h3> Anonymize named entitiies </h3>

For anonymity, remove the following and replace with their named entity type.

- PERSON (Names of people)
- ORG (Organisations) 
- GPE (Geopolitical entity)
- LOC (Location)
- DATE (Time of date)

DATE is included above, as terms such as '5th of February', or 'last week' often occur in sentences with a time seed word. They tend not to focus on issues of time duration perception. To make it easier to identify these sentences and exclude them from BERTopic analyses, they are replaced with the placeholder 'DATE'.



In [None]:
remove_list=[]

nlp = spacy.load('en_core_web_lg')

for i, text in enumerate(df.text):
    print(i)
    doc = nlp(text)
    # Replace named entities with their entity type
    new_tokens = []
    for token in doc:
        if token.ent_type_ in ["PERSON", "ORG", "GPE", "LOC", "DATE"] and token.text != "time":
            remove_list.append(token.text)
            new_tokens.append(token.ent_type_)
        else:
            new_tokens.append(token.text)

    # Join the new tokens into a single string
    df.loc[i, "text"] = " ".join(new_tokens)

<h3> Tokenization </h3>

In [23]:
def tokenize_words(string):
    data = nltk.word_tokenize(string)  # tokenize string to words
    return data

df.text = df.text.loc[:].apply(tokenize_words)

<h3> Remove specific 2-3 word phrases containg time words </h3>

Removes phrases such as 'first time', 'point in time', or 'long story', as these phrases are describing points in time or idioms unrelated to time perception. 

<br>

**Note**: Whether to include sentinment-based idioms of time phrases could be debated ('best time', 'wonderful time').

In [24]:
#this is an overview of all the removed time words together (some don't make sense because fo the three word combination phrases)
['first time', 'second time', 'third time', 'fourth time', 'fifth time', 'sixth time', 'seventh time', 'eigth time', 'nineth time', 'tenth time', 'whole time', 'this time', 'every time', 'each time', 'that time', 'next time', 'one time', 'good time', 'bad time', 'same time', 'last time', 'hard time', 'great time', 'entire time', 'some time', 'current time', 'single time', 'my time', 'winter time', 'summer time', 'spring time', 'dinner time', 'wonderful time', 'night time', 'right time', 'have time', 'had time', 'best time', 'awful time', 'worst time', 'free time', 'quality time', 'another time', 'popular time', 'in short', 'stopped short', 'running short', 'not long', 'to rate', 'I rate', 'would rate', 'any rate', 'week period', 'year period', 'in length', 'time first', 'time second', 'time third', 'time fourth', 'time fifth', 'time sixth', 'time seventh', 'time eigth', 'time nineth', 'time tenth', 'time whole', 'time this', 'time every', 'time each', 'time that', 'time next', 'time one', 'time good', 'time bad', 'time same', 'time last', 'time hard', 'time great', 'time entire', 'time some', 'time current', 'time single', 'time my', 'time winter', 'time summer', 'time spring', 'time dinner', 'time wonderful', 'time night', 'time right', 'time have', 'time had', 'time best', 'time awful', 'time worst', 'time free', 'time quality', 'time another', 'time popular', 'short in', 'short stopped', 'short running', 'long not', 'rate to', 'rate I', 'rate would', 'rate any', 'length in', 'period week', 'period year', 'second', 'seconds', 'minute', 'minutes', 'hour', 'hours', 'day', 'days', 'week', 'weeks', 'weekend', 'weekends', 'month', 'months', 'year', 'years', 'times', 'spent', 'point in time', 'point the time', 'point such time', 'point a time', 'point only time', 'point was time', 'point my time', 'point about time', 'point any time', 'by in time', 'by the time', 'by such time', 'by a time', 'by only time', 'by was time', 'by my time', 'by about time', 'by any time', 'around in time', 'around the time', 'around such time', 'around a time', 'around only time', 'around was time', 'around my time', 'around about time', 'around any time', 'at in time', 'at the time', 'at such time', 'at a time', 'at only time', 'at was time', 'at my time', 'at about time', 'at any time', 'for in time', 'for the time', 'for such time', 'for a time', 'for only time', 'for was time', 'for my time', 'for about time', 'for any time', 'until in time', 'until the time', 'until such time', 'until a time', 'until only time', 'until was time', 'until my time', 'until about time', 'until any time', 'the in time', 'the the time', 'the such time', 'the a time', 'the only time', 'the was time', 'the my time', 'the about time', 'the any time', 'it in time', 'it the time', 'it such time', 'it a time', 'it only time', 'it was time', 'it my time', 'it about time', 'it any time', 'when in time', 'when the time', 'when such time', 'when a time', 'when only time', 'when was time', 'when my time', 'when about time', 'when any time', 'was in time', 'was the time', 'was such time', 'was a time', 'was only time', 'was was time', 'was my time', 'was about time', 'was any time', 'pass in time', 'pass the time', 'pass such time', 'pass a time', 'pass only time', 'pass was time', 'pass my time', 'pass about time', 'pass any time', 'all in time', 'all the time', 'all such time', 'all a time', 'all only time', 'all was time', 'all my time', 'all about time', 'all any time', 'from time to time', 'in a long time', 'time to time', 'one more time', 'waste of time', 'as long as', 'as quickly as', 'fast and furious', 'period of years', 'my period of', 'at this rate', 'length of this', 'feet long', 'as fast as possible']


#two word phrases
preword_dict = {"time":  ["first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eigth", "nineth", "tenth", "whole", "this", "every", "each", "that", "next", "one", "good", 'bad', "same", "last", "hard", "great", "entire", "some", "current", "single", "my", "winter", "summer", "spring", "dinner", "wonderful", "night", 'right', 'have', 'had', "best", 'awful', 'worst', "free", "quality", 'another', 'popular'],
                "short": ["in", "stopped", 'running'],
                'long':  ["not"],
                'rate':  ["to", "I", "would", "any"],
                'length':["in"],
                'period':  ["week", "year"] 
                }

postword_dict = {
                "time" : [],
                "short": ["story", "sentences", 'phrase', 'version', 'versions'],
                'long':  ["story", "pants", "after", "gone", "before", 'walk', 'walks', 'hair'],
                'rate':  [],
                'period':  ["to", "I", "would", "any"],
                'length':  ["in"]
                }

#three_word_phrases combinations pre time
one_word_pre_time = ["in", "the", "such", "a", "only", "was", 'my', 'about', 'any']
two_words_pre_time = ["point", "by", "around", "at", "for", "until", "the", "it", "when", 'was', 'pass', 'all']


#replace with "PLACEHOLDER"
for key in preword_dict:
    for text in df.text:
        for i, word in enumerate(text[:-3]):
            if key == word and text[i-1] in preword_dict[key]:
                text[i] = "PLACEHOLDER"
            elif key == word and text[i+1] in postword_dict[key]:
                text[i] = "PLACEHOLDER" #removed later - useful for indexing
            elif word == "time" and text[i-1] in one_word_pre_time and text[i-2] in two_words_pre_time:
                text[i] = "PLACEHOLDER"
            elif word == "time" and text[i-1] == "from" and text[i+1] == "to" and text[i-3] == "time":
                text[i] = "PLACEHOLDER"
            elif word == "time" and text[i-3] == "in" and text[i-2] == "a" and text[i-3] == "long":
                text[i] = "PLACEHOLDER"
            elif word == "time" and text[i+1] == "to" and text[i+2] == "time":
                text[i] = "PLACEHOLDER"
            elif word == "time" and text[i-2] == "time" and text[i-1] == "to":
                text[i] = "PLACEHOLDER"
            elif word == "time" and text[i-2] == "one" and text[i-1] == "more":
                text[i] = "PLACEHOLDER"
            elif word == "time" and text[i-2] == "waste" and text[i-1] == "of":
                text[i] = "PLACEHOLDER"
            elif word == "long" and text[i-1] == "as" and text[i+1] == "as":
                text[i] = "PLACEHOLDER"
            elif word == "quickly" and text[i-1] == "as" and text[i+1] == "as":
                text[i] = "PLACEHOLDER"
            elif word == "fast" and text[i+1] == "and" and text[i+2] == "furious":
                text[i] = "PLACEHOLDER"
            elif word == "period" and text[i+1] == "of" and text[i+2] == "years":
                text[i] = "PLACEHOLDER"
            elif word == "period" and text[i+1] == "of" and text[i+2] == "my":
                text[i] = "PLACEHOLDER"
            elif word == "rate" and text[i-2] == "at" and text[i-1] == "this":
                text[i] = "PLACEHOLDER"
            elif word == "length" and text[i+1] == "of" and text[i+2] == "this":
                text[i] = "PLACEHOLDER"
            elif word == "fast" and text[i+1] == "as" and text[i+2] == "possible":
                text[i] = "PLACEHOLDER"


In [25]:
#save
df.to_pickle("processed_data_2.pkl")