# Data Cleaning

When we are using a preconfigured word embeddings model and convolution or recurrent neural networks the data cleaning process is quite different from the one applied when preparing data for traditional machine learning models.

When we are fitting an RNN we often want to keep the text in it's original format because it contains important information. For example, we often don't want to get rid of stopwords because they might tell us something about surround words such as whether a word is a noun or a verb. Instead, the main goal of cleaning when using an RNN with pre-trained word embeddings is getting the vocabulary used in the dataset as similar as possible to that used in the embeddings model; this is important because unless the word can be transformed into a vector the model cannot use it. The specifics of how this is done depends on the embedding. We will be using the glove.840B.300d pre-trained vectors (https://nlp.stanford.edu/projects/glove/). The general approach is approximately the same for other word embeddings.

What we will do here is go through an iterative process that starts by looking at how much of the data can be represented by the pre-trained word vectors and identifying the the most common tokens that can't be represented. We then write a function that cleans the data so that these tokens are modified to text that can be represented. We then repeat this process by checking which tokens can't be represented and tweaking our cleaning function until we are happy that it cleans the data well enough. This function can then be used as part of our pipeline well modelling.

Note: Note that tqdm is a nice little package that you can use to output a progress bar when you perform a loop. For more information https://pypi.org/project/tqdm/. In most cases I ended up not using this in this code because it doesn't display well in a jupyter notebook, but when working on your own code it can be useful to see that your program is progressing.

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

In [None]:
train = pd.read_csv("../../data/labeledTrainData.tsv", sep='\t', encoding='utf-8')
print("Train shape : ",train.shape)

In [None]:
def build_vocab(sentences, verbose =  False):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

One of the first things we need to do is to create a dictionary which for each occuring word records the number of occurances. We can then compare this with the words in the embedding that we are using to see if it exists and calculate statistics on the percentage of words that do and don't exist.

In [None]:
sentences = train["review"].progress_apply(lambda x: x.split()).values
vocab = build_vocab(sentences)

Here we will load the Glove embeddings file into a dict where the word is the key and the vector is the value.

In [None]:
EMBEDDING_FILE_GLOVE = '../../embeddings/glove.840B.300d/glove.840B.300d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embed_glove = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE_GLOVE, encoding='latin'))

Next the percentage of the vocabulary and of all the text are calculated so we can see how well the text in its current format is covered by the pre-trained word vectors.

In [None]:
import operator

def check_coverage(vocab,embed_glove):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in vocab:
        try:
            a[word] = embed_glove[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

oov = check_coverage(vocab,embed_glove)

If we look at the top ten words that do not occur in the Glove Embedding we should get some idea how the data might need cleaning to fit the embeddings better and get a better coverage.

In [None]:
print(oov[:10])

The most obvious thing here is that the data includes some html. We can get rid of that using the fantastic beautifulsoup package. To do this we will create a clean_text function that is passed each review and returns back a cleaned version of the review. We can then run code similar to above to see how well the pre-trained vectors covers the cleaned data.

In [None]:
from bs4 import BeautifulSoup 

def clean_text(text):
    
    text = str(text)
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    return( text )

def run_cleaning_output_summary(fun):
    cleaned = train["review"].apply(lambda x: fun(x))
    sentences = cleaned.apply(lambda x: x.split())
    vocab = build_vocab(sentences)
    oov = check_coverage(vocab,embed_glove)
    print("Top 20 most common words not found in Glove embedding:")
    print(oov[:30])
    
run_cleaning_output_summary(clean_text)

After printing out the 20 most common words that aren't in the embedding we can see that punctuation is causing a significant problem. First let us see if punctuation has a corresponding Glove vector.

In [None]:
print("Check to see which punctuation if any have vectors:")
for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
    print(punct, " : ", punct in embed_glove)

Almost everything does. If we surround punctuation by a space then punctuation will be picked up as a token and converted into it's corresponding vector. Let's try it and see what happens. To do this we modify the clean_text function and run the above code again to see if the coverage improves:

In [None]:
def clean_text(text):
    
    text = str(text)
    text = BeautifulSoup(text, 'html.parser').get_text()

    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~“”’’”“':
        text = text.replace(punct, f' {punct} ')
    
    return text

run_cleaning_output_summary(clean_text)

We've gone from being able to map 89.17% of the words to vectors to 99.77% of the words which is a great improvement and we are definitely getting to the point where cleaning the data further is not worth the effort. So lets try one last iteration on this data cleaning process.

The easiest words to convert here are foreign words containing accents. We simply need to replace the accented letter. I also replaced \x96 with a - and the ’ with a ' although it's not clear whether it is a good idea to do this without seeing how it impacts model accuracy.

In [None]:
def clean_text(text):
    
    text = str(text)
    text = BeautifulSoup(text, 'html.parser').get_text()

    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~“”’’”“':
        text = text.replace(punct, f' {punct} ')
        
    text = text.replace("cliché", 'cliche')   
    text = text.replace("clichés", 'cliches')
    text = text.replace("cliched", 'cliched')
    text = text.replace("fiancé", 'fiance')
    text = text.replace("fiancée", 'fiancee')
    text = text.replace("café", 'cafe')
    text = text.replace("matinée", 'matinee')
    text = text.replace("naïve", 'naive')
    text = text.replace("José", 'Jose')
    text = text.replace("risqué", 'risque')
    
    text = text.replace('\x96', '-')
    
    text = text.replace('’', "'")
    
    return text

run_cleaning_output_summary(clean_text)

We could go further but now we have found embeddings for 99.8% we should definitely question whether the extra effort is worth it.

We now have a function that can be used to clean the data ready to be used by our RNN.