#Text Preprocessing

Here we will be looking at reducing the plots down into a workable list of tokens. The idea here is to simplify as much as possible whilst still keeping useful information about the plot.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Loading in data frame
data=pd.read_csv('C:\\Users\\Danie\\OneDrive\\Documents\\_Uni\\Maths\\Year 4\\Data Science Toolkit\\PreProcessedData.csv')
#Creating list of film plots as a list of strings
plots=data.Plot

We begin by performing basic edits to the text. These include lowercasing all words for simplicity and removing unnecesary punctuation. One problem we found was hyphens in words were not counted as punctuation and left two words represented as one, such as 'ill-timed'. To fix this we split the word into by replacing the hyphen with a space.

In [3]:
#BASIC TEXT EDITS
import re
#Switching out hyphens for spaces
plots=[str(plots[num]).replace('-',' ') for num in range(len(plots))]
# Removing punctuation
plots=[re.sub(r'[^\w\s]','', str(plots[num])) for num in range(len(plots))]
# Lowercasing the words
plots=[str(plots[num]).lower() for num in range(len(plots))]

Next we remove all words that we know will be irrelevant in the final model. These include prepositions such as 'above, behind, with', predeterminers such as 'both, many', and pronouns etc. Fortunately in natural language processing these are recognised as 'stopwords'. We can simply import a pre-made list of stopwords in the English language and remove them from our text. Once these are gone, we 'tokenize' our text: turning into a list of every word that appears. (Note: this automatically removes duplicates)

In [4]:
#REMOVING STOP WORDS AND TOKENIZING
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#Define english stop words
stop_words = set(stopwords.words('english'))

#Define function for removing stop words + tokenizing
def stopntokenize(text):
    word_tokens = word_tokenize(text)
    new_text = []
    for w in word_tokens:
        if w not in stop_words:
            new_text.append(w)
    return(new_text)
    

Whilst removing stopwords does a lot to remove unnecessary words, we can also go one step further and remove anything that isn't an adjective, verb or noun. (We presume here that adverbs will be irrelevant to our final model as they rarely seem to provide a unique description of the text). To do this we use nltk's built in function for classifying each word.

In [5]:
#WORDREMOVER

#Defining function for removing any word not a Noun, Adjective or Verb
def wordremover(tokens):
    wordtypes = nltk.pos_tag(tokens)
    tokens_new=[]
    for i in range(len(tokens)):
        if wordtypes[i][1] in ['NN','NNP','JJ','JJR','JJS','VB','VBD','VBG','VBN','VBP','VBZ']:
            tokens_new.append(tokens[i])
    return(tokens_new)

Now that we have only our most important words, the next step is to reduce them down into their simplest form, or their 'stem'. For instance, we would like the words 'drink','drinks', and 'drinking' to all be shortened to 'drink' for simplicity. Whilst separate functions exist for stemming and lemmatizing separately such as PorterStemmer in nltk, we found that these tended to be quite bad at over or understemming for anything other than verbs. Wordnet's built in function 'morphy' seemed to work the best in reducing any word down without many errors.

In [6]:
#LEAMMATIZING AND STEMMING
from nltk.corpus import wordnet

#Defining function to reduce word down it's simplest form
def lemmanstem(tokens):
    new_tokens=[]
    for i in range(len(tokens)):
        if wordnet.morphy(tokens[i])==None:
            new_tokens.append(tokens[i])
        else:
            new_tokens.append(wordnet.morphy(tokens[i]))
    return(new_tokens)   

In [7]:
#NONE REMOVER

#Defining function to remove Nones
def noneremover(tokens):
    for i in range(len(tokens)):
        if tokens[i]==None:
            tokens.remove(tokens[i])

Now that we have all our functions defined, we can combine them all into one master function, which we use to turn the text as a string into a list of simplified tokens to use in our topic model.

In [8]:
#STRING TO TOKENS
def tokenizer(text):
    token=stopntokenize(text)
    token=wordremover(token)
    token=lemmanstem(token)
    token=list(dict.fromkeys(token))
    noneremover(token)
    return(token)

To check how our tokenizer is working, let's try a short plot as an example, the plot to 'My Best Friend's Wedding'. On inspection this seems to be working well. All the stop words are gone, all the verbs have been reduced to their most basic form, and all the nouns are in a singular form.

In [44]:
example=plots[4]
example

'a woman who by a promise made years earlier is supposed to marry her best friend in three weeks even though she doesnt want to when she finds out that hes marrying someone else she becomes jealous and tries to break off the wedding'

In [41]:
tokenizer(example)

['woman',
 'promise',
 'make',
 'suppose',
 'marry',
 'best',
 'friend',
 'doesnt',
 'want',
 'find',
 'someone',
 'become',
 'jealous',
 'break',
 'wedding']

This seems to work well in reducing the text to a few words, but let's try to see how this working on average across the whole dataset. On average, we can see that our tokenizer reduces the amount of words in the plot by about 93%, which is a massive reduction. Objectively, we have that each plot is reduced to about 40 words. This seems like a workable amount whilst still having the potential to include all relevant information about the plot.

In [11]:
#WORD REDUCTION 

#Calculating percentage reduction of words for plots and tokenized plots
x=[]
for i in range(4000):
    x.append(len(tokenizer(plots[i]))/len(plots[i]))

sum(x)/4000

0.07198069794144195

In [12]:
#DATA SIZE

#Calculating average number of words per tokenized plot
#i.e. number of words remaining after text preprocessing
x=[]
for i in range(4000):
    x.append(len(tokenizer(plots[i])))

sum(x)/4000

38.78775

Now that we are confident in the ability of our text preprocessing, we can apply our tokenizer to every plot in our dataset. This gives us the dataset that we will be using for the topic models in the following sections.

In [346]:
#CREATING DATASET OF TOKENIZED PLOTS
token_plots=[]
for i in range(len(plots)):
    token_plots.append(tokenizer(plots[i]))

To use this list of lists in other python files, we convert each list into a string, save it to a .csv file, then this can be converted back into a list using the code below.

In [90]:
stringed_plots=[str(token_plots[num]) for num in range(len(token_plots))]
df=pd.DataFrame(stringed_plots)
df.to_csv('stringed_plots.csv')

In [173]:
#CODE FOR CONVERTING DATAFRAME OF STRINGS INTO LIST OF LISTS OF TOKENS
#stringed_data=pd.read_csv('stringed_plots.csv')
#tokenized_plots=[]
#for i in range(0,len(stringed_data)):
#    data=np.array(stringed_data.iloc[i])
#    text=data[1]
#    text=text.replace(',','')
#    text=text.replace('[','')
#    text=text.replace(']','')
#    text=text.replace("'",'')
#    tokens=word_tokenize(text)
#    tokenized_plots.append(tokens)

This is a good point to stop with the preprocessing, although we can potentially go one step further and look at the effect of synonyms. One potential issue when creating the topic model is removing words which do not appear too often. The problem here is that this potentially removes rarer synonyms of a more common word, whilst both could potentially be significant to the topic. For instance, 'conflict, war, battle, skirmish' all mean pretty much the same thing, and we would expect them to be significant to our model if they appear in a plot. However out of these 'war' is probably going to appear a lot more than the other three, so it could be that the other three are unjustly removed. It would be beneficial to us if we could reduce 'conflict, battle, skirmish' to 'war', which would not only mean they are kept in the data, but also reduce the size of it.

To start, we need to create a ranking of all words in the tokenized plots based off how much they are repeated throughout our dataset. This will come in useful when defining our synonym function.

In [46]:
#CREATING DICTIONARY
import gensim

dictionary = gensim.corpora.Dictionary(token_plots)

24569

We sort this dictionary by its counts, and translate this into a list of ranked words. We can see here the top ten most popular words across all plots, with the top three being 'life, find, new'.

In [15]:
#FINDING TOP WORDS

#Creating dictionary of wordcounts
new_dict=dictionary.cfs
#Sorting by wordcount
new_dict2 = sorted(new_dict.items(), key=lambda x:x[1],reverse=True)
#Translating word index to word
ranked_words=[]
for i in range(len(new_dict2)):
    ranked_words.append(dictionary[new_dict2[i][0]])
#Top 10 words
ranked_words[:10]

['life', 'find', 'new', 'take', 'get', 'young', 'family', 'world', 'go', 'man']

Now we are ready to define a synonym remover. The idea here is for a list of tokens, we search through each pair of tokens and use wordnet to find their 'similarity'. This gives us a numerical value of how close they are, and then by defining a threshold of similarity, we can define them as synonyms for each other. Once we have classfied them as synonyms, we keep only the more relevant one. To define their 'relevance', we use their index in our ranked words as a metric, keeping the one with the lower score. 

One issue to address is wordnet has multiple definitions for each word, so we need to make sure we are using the most common one. For instance, consider the words 'drink' and 'swallow'. Their most common defintions are as verbs, and in this case, wordnet will find that they have a similarity of 0.33. However if our function thinks of them as nouns, i.e. drink as a synonym for a body of water and swallow as the bird, it will give a much lower similairty of 0.2.

In [54]:
drink_verb=wordnet.synset('drink.v.01')
swallow_verb=wordnet.synset('swallow.v.01')
drink_noun=wordnet.synset('drink.n.01')
swallow_noun=wordnet.synset('swallow.n.01')
print(drink_verb.path_similarity(swallow_verb),drink_noun.path_similarity(swallow_noun))


0.3333333333333333 0.2


There is no guaranteed way to say what type of word a word is in our list of tokens without referring to its context within the original text. Instead we use the word's most common defintion. One way of estimating this is to count up every possible definition of a word, then pick the word type which is referred to the most. For instance, for the word 'jump', there are many slightly different definitions of the word. However by looking at all possibilities, we can see that it by far mosty used as a verb, so we would assign it as such.

In [355]:
wordnet.synsets('jump')

[Synset('jump.n.01'),
 Synset('leap.n.02'),
 Synset('jump.n.03'),
 Synset('startle.n.01'),
 Synset('jump.n.05'),
 Synset('jump.n.06'),
 Synset('jump.v.01'),
 Synset('startle.v.02'),
 Synset('jump.v.03'),
 Synset('jump.v.04'),
 Synset('leap_out.v.01'),
 Synset('jump.v.06'),
 Synset('rise.v.11'),
 Synset('jump.v.08'),
 Synset('derail.v.02'),
 Synset('chute.v.01'),
 Synset('jump.v.11'),
 Synset('jumpstart.v.01'),
 Synset('jump.v.13'),
 Synset('leap.v.02'),
 Synset('alternate.v.01')]

In [None]:
#FINDING MOST COMMON DEFINITION
def mostcommon(word):
    #Count up the amounts of each type of word definition
    verbcount=len(wordnet.synsets(word,'v'))
    verbcount2=len(wordnet.synsets(word,'s'))
    nouncount=len(wordnet.synsets(word,'n'))
    adjcount=len(wordnet.synsets(word,'a'))
    adverbcount=len(wordnet.synsets(word,'r'))
    #Find maximum count
    wordtype=max(verbcount,nouncount,adjcount,verbcount2,adverbcount)
    #Return the most basic definition of the most popular type of word
    #(If the maximum is shared by multiple, we return the type of word in this order, arbitrarily)
    if wordtype==nouncount:
        return(wordnet.synsets(word,'n')[0])
    elif wordtype==verbcount:
        return(wordnet.synsets(word,'v')[0])
    elif wordtype==adjcount:
        return(wordnet.synsets(word,'a')[0])
    elif wordtype==verbcount2:
        return(wordnet.synsets(word,'s')[0])
    elif wordtype==adverbcount:
        return(wordnet.synsets(word,'r')[0])

Now we have dealt with this problem, we are ready to define our synonym remover. Here we look at every unordered pair of tokens, find their most common definition, and use this to calculate the Wu-Palmer similarity. (We also check first to see if both words have any definitions at all, as the similarity will only work if it does. e.g. Hogwarts and Castle will not work.) After manually tweaking the threshold to see what gives the best result, we find that 0.65 works quite well.

In [412]:
#DEFINING SYNONYM REMOVER
from colorama import Fore

def synonymremover(tokens,text):
    #Defining list of unneccesary synonyms
    synonyms=[]
    for i in range(len(tokens)):
        for j in range(len(tokens)):
            if i<j:      
                word1=tokens[i]
                word2=tokens[j]
                #Get list of what the word could mean
                list_of_words1=wordnet.synsets(word1)
                list_of_words2=wordnet.synsets(word2)
                #Check that word has a possible definition
                if not (list_of_words1==[] or list_of_words2==[]):
                    #Define words as the most common 
                    word_1=mostcommon(word1)
                    word_2=mostcommon(word2)
                    #Caculate similarity
                    similarity=(word_1).wup_similarity(word_2)
                    #If they are close enough, add less significant word to synonyms
                    if similarity>0.65 and similarity !=1:
                        #Calculate significance by getting their rank from the ranked words
                        sig1 = ranked_words.index(word1)
                        sig2 = ranked_words.index(word2)
                        #Add whichever word is lower down in the ranking to synonyms
                        if sig1<sig2:
                            if text:
                                print(Fore.BLUE+word1,Fore.RED+word2)
                            synonyms.append(tokens[j])
                        else:
                            if text:
                                print(Fore.BLUE+word2,Fore.RED+word1)
                            synonyms.append(tokens[i])
    #Remove all unneccesary synonyms
    for word in tokens:
        if word in synonyms:
            tokens.remove(word)
    return(tokens)

We can see how this works using the plot of the film 'Millions' as an example. Here if two words are declared as synonyms by our function, it prints the more relevant one we will keep in blue and the less relevant one we will omit in red.

In [419]:
synonymremover(tokenizer(plots[200]),True)

[34mmoney [31mcurrency
[34mpound [31meuro
[34mload [31mplayhouse
[34mfilm [31mload
[34mfilm [31mplayhouse


['uk',
 'switch',
 'pound',
 'giving',
 'gang',
 'chance',
 'rob',
 'secure',
 'train',
 'money',
 'way',
 'incineration',
 'robbery',
 'big',
 'falls',
 'sky',
 'year',
 'old',
 'given',
 'talking',
 'boy',
 'start',
 'seeing',
 'world',
 'make',
 'human',
 'soul',
 'come',
 'forefront',
 'film']

In [415]:
#UPDATING TOKENIZED PLOTS
#Takes quite a while atm lol
#200

#for i in range(100):
 #   token_plots[i]=synonymremover(token_plots[i],False)
synonymremover(tokenizer(plots[200]),True)

In [418]:
wordnet.synsets('hogwarts')

[]