Save Prepared Data

In [3]:
import string
import re
from os import listdir
from nltk.corpus import stopwords

We can use the data cleaning and chosen vocabulary to prepare each movie review and save the prepared version of the reviews ready for modelling. This is a good practise as it decouples the data preparation from modeling, allowing you to focus on modelling and circle back to data prep if ypu have new ideas.
We can start off by loading the vocabulary from vocab.txt

In [4]:
#load doc into memory
def load_doc(filename):
    #open the file as read only
    file=open(filename,'r')
    #read all text
    text=file.read()
    #close the file
    file.close()
    return text

NExt, we can clean the reviews, use the loaded vocab to filter out unwanted tokens, and save the 
clean reviews in a file. Our approach could be to save all the positive reviews in one file and all the negative reviews in another file, with the filtered tokens separated by white spce for each review on separate lines.
First, we can deﬁne a function to process a document, clean it, ﬁlter it, and return it as a single line that could be saved in a ﬁle. Below deﬁnes the doc to line() function to do just that, taking a ﬁlename and vocabulary (as a set) as arguments. It calls the previously deﬁned load doc() function to load the document and clean doc() to tokenize the document.


In [5]:
#turn a doc into clean tokens
def clean_doc(doc):
    #split into tokens by white space
    tokens=doc.split()
    #prepare a regex for char filtering
    re_punc=re.compile('[%s]' %re.escape(string.punctuation))
    #remove punctuation from each word
    tokens=[re_punc.sub('',w) for w in tokens]
    #remove remaining tokens that are not alphabetic
    tokens=[word for word in tokens if word.isalpha()]
    #filter out stopwords
    stop_words=set(stopwords.words('english'))
    tokens=[w for w in tokens if not w in stop_words]
    #filter out short tokens
    tokens=[word for word in tokens if len(word)>1]
    return tokens    

In [6]:
#save list to file
def save_list(lines,filename):
    data='\n'.join(lines)
    file=open(filename,'w')
    file.write(data)
    file.close()

In [7]:
#load doc,clean and return lines of tokens
def doc_to_line(filename,vocab):
    doc=load_doc(filename)
    tokens=clean_doc(doc)
    #filter by vocab
    tokens=[w for w in tokens if w in vocab]
    return ' '.join(tokens)

Next, we can deﬁne a new version of process docs() to step through all reviews in a folder and convert them to lines by calling doc to line() for each document. A list of lines is then returned.


In [8]:
#load docs in a directory
def process_docs(directory,vocab):
    lines=list()
    #walk through all files in the folder
    for filename in listdir(directory):
        if not filename in listdir(directory):
            next
        #create the full path of the file to open
        path=directory+'/'+filename
        #load and clean the doc
        line=doc_to_line(path,vocab)
        #add to list
        lines.append(line)
    return lines

In [9]:
vocab_filename='vocab.txt'
vocab=load_doc(vocab_filename)
vocab=vocab.split()
vocab=set(vocab)

In [10]:
#prepare negative reviews
negative_lines=process_docs('data/review_polarity/txt_sentoken/neg',vocab)
save_list(negative_lines,'negative.txt')

In [12]:
#prepare positive reviews
positive_lines=process_docs('data/review_polarity/txt_sentoken/pos',vocab)
save_list(positive_lines,'positive.txt')

Running the examples saves two new files, negative.txt and positive.txt, that conatain the prepared negative and positive reviews respectively.
The data is ready for use in a bag-of-words or even word embedding model.