# Explore text preprocessing with IMDB dataset
IMDB dataset: http://ai.stanford.edu/~amaas/data/sentiment/

### Reading the Data

In [1]:
txt_path = 'data/aclImdb/train/neg/10_2.txt'
txt = open(txt_path).read()

In [2]:
txt

"This film had a lot of promise, and the plot was relatively interesting, however the actors, director and editors seriously let this film down.<br /><br />I feel bad for the writers, it could have been good. The acting is wooden, very few of the characters are believable.<br /><br />Who ever edited this clearly just learnt some new edit techniques and wanted to splash them all over the film. There are lots of quick 'flashy' edits in almost every scene, which are clearly meant to be symbolic but just end up as annoying.<br /><br />I wanted to like this film and expected there to be a decent resolution to the breakdown of equilibrium but alas no, it left me feeling like I'd wasted my time and the film makers had wasted their money."

### 1. Removing HTML Markup:
The BeautifulSoup Package  
Calling get_text() gives you the text of the review, without tags or markup.

In [4]:
from bs4 import BeautifulSoup

In [6]:
example1 = BeautifulSoup(txt, 'lxml')

#print train["review"][0]
print example1.get_text()

This film had a lot of promise, and the plot was relatively interesting, however the actors, director and editors seriously let this film down.I feel bad for the writers, it could have been good. The acting is wooden, very few of the characters are believable.Who ever edited this clearly just learnt some new edit techniques and wanted to splash them all over the film. There are lots of quick 'flashy' edits in almost every scene, which are clearly meant to be symbolic but just end up as annoying.I wanted to like this film and expected there to be a decent resolution to the breakdown of equilibrium but alas no, it left me feeling like I'd wasted my time and the film makers had wasted their money.


### 2. Turn paragraph into sentences
Optinal. For word2vec model, this method might be helpful. But for doc2vec, we treat the whole document as a complete unit.

In [16]:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [18]:
sentences = tokenizer.tokenize(example1.get_text().strip().decode('utf-8'))

In [19]:
sentences

[u'This film had a lot of promise, and the plot was relatively interesting, however the actors, director and editors seriously let this film down.I feel bad for the writers, it could have been good.',
 u'The acting is wooden, very few of the characters are believable.Who ever edited this clearly just learnt some new edit techniques and wanted to splash them all over the film.',
 u"There are lots of quick 'flashy' edits in almost every scene, which are clearly meant to be symbolic but just end up as annoying.I wanted to like this film and expected there to be a decent resolution to the breakdown of equilibrium but alas no, it left me feeling like I'd wasted my time and the film makers had wasted their money."]

### 3. Only keep english characters
By remove all punctuation and numbers.  
However, some model needs to define the context within one sentence, so punctuation should be removed after the split of sentences.

In [7]:
import re

In [8]:
letters_only = re.sub("[^a-zA-Z]", " ", example1.get_text())

In [9]:
letters_only

u'This film had a lot of promise  and the plot was relatively interesting  however the actors  director and editors seriously let this film down I feel bad for the writers  it could have been good  The acting is wooden  very few of the characters are believable Who ever edited this clearly just learnt some new edit techniques and wanted to splash them all over the film  There are lots of quick  flashy  edits in almost every scene  which are clearly meant to be symbolic but just end up as annoying I wanted to like this film and expected there to be a decent resolution to the breakdown of equilibrium but alas no  it left me feeling like I d wasted my time and the film makers had wasted their money '

### 4. Turn all words into their lowercase and split for use

In [13]:
words = letters_only.lower().split()
words = [w.strip() for w in words]

### 5. Remove stopwords
Optional. Removing stopwords might be helpful for bag-of-words model, but might hurt the word2vec and doc2vec model.

In [10]:
from nltk.corpus import stopwords

In [14]:
print(len(words))
words = [w for w in words if not w in stopwords.words("english")]
print(len(words))

130
62


In [15]:
' '.join(words)

u'film lot promise plot relatively interesting however actors director editors seriously let film feel bad writers could good acting wooden characters believable ever edited clearly learnt new edit techniques wanted splash film lots quick flashy edits almost every scene clearly meant symbolic end annoying wanted like film expected decent resolution breakdown equilibrium alas left feeling like wasted time film makers wasted money'

### 6. Porter Stemming and Lemmatizing
**Porter Stemming and Lemmatizing (both available in NLTK)** would allow us to treat "messages", "message", and "messaging" as the same word, which could certainly be useful.  
**Stemming** is a little big trikey. So we only test **Lemmatizing** here.

In [20]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [21]:
clean_words = [wordnet_lemmatizer.lemmatize(w) for w in words]

In [23]:
' '.join(clean_words)

u'film lot promise plot relatively interesting however actor director editor seriously let film feel bad writer could good acting wooden character believable ever edited clearly learnt new edit technique wanted splash film lot quick flashy edits almost every scene clearly meant symbolic end annoying wanted like film expected decent resolution breakdown equilibrium ala left feeling like wasted time film maker wasted money'

#### The lemmatizer is based on POS. For better lemmatizing, we may better apply POS first.  
ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'

In [27]:
%time word_pos = nltk.pos_tag(words)

CPU times: user 148 ms, sys: 8 ms, total: 156 ms
Wall time: 344 ms


In [36]:
def pos_tag_map(tag):
    if tag.startswith('J'):
        return 'a'
    elif tag.startswith('V'):
        return 'v'
    elif tag.startswith('N'):
        return 'n'
    elif tag.startswith('R'):
        return 'r'
    else:
        return 'n'

In [38]:
%time clean_words = [wordnet_lemmatizer.lemmatize(w, pos_tag_map(pos)) for (w, pos) in word_pos]

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.38 ms


### Well, not bad :/
Because sometimes the POS TAG would be ambiguous. 

In [39]:
' '.join(clean_words)

u'film lot promise plot relatively interesting however actor director editor seriously let film feel bad writer could good act wooden character believable ever edit clearly learnt new edit technique want splash film lot quick flashy edits almost every scene clearly mean symbolic end annoy want like film expect decent resolution breakdown equilibrium ala leave feel like wasted time film maker waste money'

In [43]:
' '.join(words)

u'film lot promise plot relatively interesting however actors director editors seriously let film feel bad writers could good acting wooden characters believable ever edited clearly learnt new edit techniques wanted splash film lots quick flashy edits almost every scene clearly meant symbolic end annoying wanted like film expected decent resolution breakdown equilibrium alas left feeling like wasted time film makers wasted money'