# Text Preprocessing
## Part 1: Basic Preprocessing
Here we will be looking at reducing the plots down into a workable list of tokens. The idea here is to simplify as much as possible whilst still keeping useful information about the plot.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Loading in data frame
data=pd.read_csv('C:\\Users\\Danie\\OneDrive\\Documents\\_Uni\\Maths\\Year 4\\Data Science Toolkit\\PreProcessedData.csv')
#Creating list of film plots as a list of strings
plots=data.Plot

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Danie\\OneDrive\\Documents\\_Uni\\Maths\\Year 4\\Data Science Toolkit\\PreProcessedData.csv'

We begin by performing basic edits to the text. These include lowercasing all words for simplicity and removing unnecesary punctuation. One problem we found was hyphens in words were not counted as punctuation and left two words represented as one, such as 'ill-timed'. To fix this we split the word into by replacing the hyphen with a space.

In [None]:
#BASIC TEXT EDITS
import re
#Switching out hyphens for spaces
plots=[str(plots[num]).replace('-',' ') for num in range(len(plots))]
# Removing punctuation
plots=[re.sub(r'[^\w\s]','', str(plots[num])) for num in range(len(plots))]
# Lowercasing the words
plots=[str(plots[num]).lower() for num in range(len(plots))]

Next we remove all words that we know will be irrelevant in the final model. These include prepositions such as 'above, behind, with', predeterminers such as 'both, many', and pronouns etc. Fortunately in natural language processing these are recognised as 'stopwords'. We can simply import a pre-made list of stopwords in the English language and remove them from our text. Once these are gone, we 'tokenize' our text: turning into a list of every word that appears. (Note: this automatically removes duplicates)

In [None]:
#REMOVING STOP WORDS AND TOKENIZING
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#Define english stop words
stop_words = set(stopwords.words('english'))

#Define function for removing stop words + tokenizing
def stopntokenize(text):
    word_tokens = word_tokenize(text)
    new_text = []
    for w in word_tokens:
        if w not in stop_words:
            new_text.append(w)
    return(new_text)
    

Whilst removing stopwords does a lot to remove unnecessary words, we can also go one step further and remove anything that isn't an adjective, verb or noun. (We presume here that adverbs will be irrelevant to our final model as they rarely seem to provide a unique description of the text). To do this we use nltk's built in function for classifying each word.

In [None]:
#WORDREMOVER

#Defining function for removing any word not a Noun, Adjective or Verb
def wordremover(tokens):
    wordtypes = nltk.pos_tag(tokens)
    tokens_new=[]
    for i in range(len(tokens)):
        if wordtypes[i][1] in ['NN','NNP','JJ','JJR','JJS','VB','VBD','VBG','VBN','VBP','VBZ']:
            tokens_new.append(tokens[i])
    return(tokens_new)

Now that we have only our most important words, the next step is to reduce them down into their simplest form, or their 'stem'. For instance, we would like the words 'drink','drinks', and 'drinking' to all be shortened to 'drink' for simplicity. Whilst separate functions exist for stemming and lemmatizing separately such as PorterStemmer in nltk, we found that these tended to be quite bad at over or understemming for anything other than verbs. Wordnet's built in function 'morphy' seemed to work the best in reducing any word down without many errors.

In [None]:
#LEAMMATIZING AND STEMMING
from nltk.corpus import wordnet

#Defining function to reduce word down it's simplest form
def lemmanstem(tokens):
    new_tokens=[]
    for i in range(len(tokens)):
        if wordnet.morphy(tokens[i])==None:
            new_tokens.append(tokens[i])
        else:
            new_tokens.append(wordnet.morphy(tokens[i]))
    return(new_tokens)   

In [None]:
#NONE REMOVER

#Defining function to remove Nones
def noneremover(tokens):
    for i in range(len(tokens)):
        if tokens[i]==None:
            tokens.remove(tokens[i])

Now that we have all our functions defined, we can combine them all into one master function, which we use to turn the text as a string into a list of simplified tokens to use in our topic model.

In [None]:
#STRING TO TOKENS
def tokenizer(text):
    token=stopntokenize(text)
    token=wordremover(token)
    token=lemmanstem(token)
    token=list(dict.fromkeys(token))
    noneremover(token)
    return(token)

To check how our tokenizer is working, let's try a short plot as an example, the plot to 'My Best Friend's Wedding'. On inspection this seems to be working well. All the stop words are gone, all the verbs have been reduced to their most basic form, and all the nouns are in a singular form.

In [None]:
example=plots[4]
example

In [None]:
tokenizer(example)

This seems to work well in reducing the text to a few words, but let's try to see how this working on average across the whole dataset. On average, we can see that our tokenizer reduces the amount of words in the plot by about 93%, which is a massive reduction. Objectively, we have that each plot is reduced to about 40 words. This seems like a workable amount whilst still having the potential to include all relevant information about the plot.

In [None]:
#WORD REDUCTION 

#Calculating percentage reduction of words for plots and tokenized plots
x=[]
for i in range(4000):
    x.append(len(tokenizer(plots[i]))/len(plots[i]))

sum(x)/4000

In [None]:
#DATA SIZE

#Calculating average number of words per tokenized plot
#i.e. number of words remaining after text preprocessing
x=[]
for i in range(4000):
    x.append(len(tokenizer(plots[i])))

sum(x)/4000

Now that we are confident in the ability of our text preprocessing, we can apply our tokenizer to every plot in our dataset. This gives us the dataset that we will be using for the topic models in the following sections.

In [None]:
#CREATING DATASET OF TOKENIZED PLOTS
token_plots=[]
for i in range(len(plots)):
    token_plots.append(tokenizer(plots[i]))

To use this list of lists in other python files, we convert each list into a string, save it to a .csv file, then this can be converted back into a list using the code below.

In [None]:
stringed_plots=[str(token_plots[num]) for num in range(len(token_plots))]
df=pd.DataFrame(stringed_plots)
df.to_csv('stringed_plots.csv')

In [None]:
#CODE FOR CONVERTING DATAFRAME OF STRINGS INTO LIST OF LISTS OF TOKENS
#stringed_data=pd.read_csv('stringed_plots.csv')
#tokenized_plots=[]
#for i in range(0,len(stringed_data)):
#    data=np.array(stringed_data.iloc[i])
#    text=data[1]
#    text=text.replace(',','')
#    text=text.replace('[','')
#    text=text.replace(']','')
#    text=text.replace("'",'')
#    tokens=word_tokenize(text)
#    tokenized_plots.append(tokens)

##Part 2: Synonym extension

This is a good point to stop with the preprocessing, although we can potentially go one step further and look at the effect of synonyms. One potential issue when creating the topic model is removing words which do not appear too often in the dictionary. The problem here is that this potentially removes rarer synonyms of a more common word, whilst both could potentially be significant to the topic. For instance, 'conflict, war, battle, skirmish' all mean pretty much the same thing, and we would expect them to be significant to our model if they appear in a plot. However out of these 'war' is probably going to appear a lot more than the other three, so it could be that the other three are unjustly removed. It would be beneficial to us if we could reduce 'conflict, battle, skirmish' to 'war', which would essentially save these words from being removed. 

Hence we need a function that takes a word, checks all its potential synonyms, and then picks the one most likely to be kept after removing extremes in our dictionary.

To start, we need to create a ranking of all words in the tokenized plots based off how much they are repeated throughout our dataset. This will come in useful when defining our synonym function. We start by creating a dictionary of all words, and then a dictionary with extremes removed.

In [None]:
#CREATING DICTIONARY
import gensim

dictionary = gensim.corpora.Dictionary(token_plots)
dictionary_trimmed = gensim.corpora.Dictionary(token_plots)

#We remove words which come up too often or not often enough
dictionary_trimmed.filter_extremes(no_below=25, no_above=0.5, keep_n=100000)

We sort this dictionary by its counts, and translate this into a list of ranked words. We can see here the top ten most popular words across all plots, with the top three being 'life, find, new'.

In [None]:
#FINDING TOP WORDS

#Creating dictionary of wordcounts
new_dict=dictionary.cfs
new_dict_trimmed=dictionary_trimmed.cfs
#Sorting by wordcount
new_dict2 = sorted(new_dict.items(), key=lambda x:x[1],reverse=True)
new_dict2_trimmed = sorted(new_dict_trimmed.items(), key=lambda x:x[1],reverse=True)
#Translating word index to word
ranked_words=[]
for i in range(len(new_dict2)):
    ranked_words.append(dictionary[new_dict2[i][0]])
ranked_words_trimmed=[]
for i in range(len(new_dict2_trimmed)):
    ranked_words_trimmed.append(dictionary_trimmed[new_dict2_trimmed[i][0]])
#Top 10 words
ranked_words[:10]

One issue to address first is wordnet has multiple definitions for each word, so we need to make sure we are using the most common one. nltk already has the built in function pos_tag, although this is not always reliable. Often this decides a word is a noun by default if it is possible for it to be one, even if it is rarely used as such. For instance, most people would agree the most normal usage of 'jump' is as a verb, but since it can be used as a noun, pos_tag says it is one. This leads us to try and define our own.

In [None]:
nltk.pos_tag(['jump'])

There is no guaranteed way to say what type of word a word is in our list of tokens without referring to its context within the original text. Instead we use the word's most common defintion. One way of estimating this is to count up every possible definition of a word, then pick the word type which is referred to the most. For instance, for the word 'jump', there are many slightly different definitions of the word. However by looking at all possibilities, we can see that it by far mosty used as a verb, so we would assign it as such.

In [None]:
wordnet.synsets('jump')

In [None]:
#DEFINING OUR OWN POS_TAG
def poz_tag(word):
    #Count up the amounts of each type of word definition
    #Wordnet uses 'n' for nouns, 'a' for adjectives, 'r' for adverbs, and 'v' and 's' for verbs
    verbcount=len(wordnet.synsets(word,'v'))
    verbcount2=len(wordnet.synsets(word,'s'))
    nouncount=len(wordnet.synsets(word,'n'))
    adjcount=len(wordnet.synsets(word,'a'))
    adverbcount=len(wordnet.synsets(word,'r'))
    #Find maximum count
    wordtype=max(verbcount,nouncount,adjcount,verbcount2,adverbcount)
    #Return the most basic definition of the most popular type of word
    #(If the maximum is shared by multiple, we return the type of word in this order, arbitrarily)
    if wordtype==nouncount:
        return('n')
    elif wordtype==verbcount:
        return('v')
    elif wordtype==adjcount:
        return('a')
    elif wordtype==verbcount2:
        return('s')
    elif wordtype==adverbcount:
        return('r')

Now we have dealt with this problem, we are ready to define our word to synonym transformer. We begin by creating a list of synonyms for the word which we can do using in-built functions in wordnet. We make sure however we restrict the search to only synonyms of the type of word poz_tag gives us. Once we have a selection of synonyms to choose from, we select the 'best' one. As a metric for this, we first look at the synonyms that are in our ranked words, i.e. the ones that would be saved from potential trimming of the data. Out of these we choose the most 'significant', using the index of the ranked words as a metric. (Note that it is possible that multiple words might be mapped to the same synonym, so we remove duplicates)

In [None]:
#Defining function to create a list of synonyms for a word
def synonyms(word):
    list_of_synonyms= []
    
    #For each synset of the word, but only those under our poz_tag
    for syn in wordnet.synsets(word,poz_tag(word)):
        for l in syn.lemmas():
            list_of_synonyms.append(l.name())
    list_of_synonyms=list(dict.fromkeys(list_of_synonyms))


    return(list_of_synonyms)

#Defining function to map a word to its best synonym
def wordtosynonym(word):
    #Creating the word's list of synonyms
    word_synonyms=synonyms(word)
    master_word=word
    for i in word_synonyms:
        #Calculate the ranking of the word based off its count in the dictionary
        sig1 = ranked_words.index(word)
        #Only turn the word into a synonym if the synonym is in the trimmed dictionary
        if i in ranked_words_trimmed:      
            sig2 = ranked_words.index(i)
            #We set this synonym as the 'best synonym' if its ranking is lower than the others
            if sig2<sig1:
                master_word=i
    return(master_word)

#Function applying wordtosynonym to every word in a list of tokens
def tokenstosynonyms(tokens,example):
    new_tokens=[]
    for word in tokens:
        new_word=wordtosynonym(word)
        if example:
            if new_word==word:
                print(Fore.BLUE+word,Fore.BLUE+new_word)
            else:
                print(Fore.BLUE+word,Fore.RED+new_word)
        new_tokens.append(new_word)
    new_tokens=list(dict.fromkeys(new_tokens))
    return(new_tokens)

We can see how this works using the plot of the film 'Up at the Villa' as an example. Here for ever word, we have the original word on the left and its 'optimal synonym' on the right. If different, they are displayed in red.

In [None]:
tokenstosynonyms(token_plots[50],True)

To see the effect of this, let us consider what happens when we remove the words removed in our trimmed dictionary to the tokenized plots. This is something we expect to happen in the topic model sections later. 

In [None]:
synonymed_plots=[]
for i in range(1000):
    synonymed_plots.append(tokenstosynonyms(token_plots[i],False))

In [None]:
#Defining a function that removes the words we defined as 'extreme'
def wordtrimmer(tokens,rank):
    new_tokens=[]
    for word in tokens:
        if word in rank:
            new_tokens.append(word)
    return(new_tokens)

Here we can the average number of words after trimming is slightly higher for the synonymed plots, meaning we have been able to keep extra words whilst still retaining effectively the same meaning.

In [None]:
print(sum([len(wordtrimmer(token,ranked_words_trimmed)) for token in synonymed_plots])/1000)
print(sum([len(wordtrimmer(token_plots[num],ranked_words_trimmed)) for num in range(1000)])/1000)

As a specific example, let's consider the plot of 'Wild Child'. Here we have the tokenized plots and the tokenized plots after having their extreme tokens removed, with the original in red and the synonymed version in blue. Firstly, the synonymed version has 2 less words than the original, most likely from removing 2 synonyms of each other within the plot. However, after trimming, the synonymed version actually has seven more words. It is difficult to see with the words written in a slightly different order, but the synonymed version has changed:

'regime' to 'government'

'dismiss' to 'fire'

'appeal' to 'attract'

all of which were originally removed but have now been saved from removal in the synonymed version.

In [None]:
M=401
#Original tokenized list
print(Fore.RED+str(token_plots[M]))
#Number of words in original list
print(len(token_plots[M]))
#Synonymed list
print(Fore.BLUE+str(synonymed_plots[M]))
#Number of words in synonymed list
print(len(synonymed_plots[M]))
#Original list after extreme words removed
print(Fore.RED+str(wordtrimmer(token_plots[M],ranked_words_trimmed)))
#Number of words remaining after trimming
print(len(wordtrimmer(token_plots[M],ranked_words_trimmed)))
#Synonymed list after extreme words removed
print(Fore.BLUE+str(wordtrimmer(synonymed_plots[M],ranked_words_trimmed)))
#Number of words remaining after trimming
print(len(wordtrimmer(synonymed_plots[M],ranked_words_trimmed)))