In [25]:
import string
import pandas as pd

## Removing punctuation

Let's first review the available punctuation characters.

In [26]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Now, let's create a function to remove punctuation.

In [27]:
def remove_punctuation(text):
    text = "".join([char for char in text if char not in string.punctuation])

    return text

Let's try this function out on a single piece of text. First, we'll import the CSV file provided and explore it briefly.

In [28]:
df = pd.read_csv('data/imdb_movie_reviews.csv')
df.head()

Unnamed: 0,label,review
0,negative,"In the ten years since Wildside aired, nothing..."
1,positive,This is a better-than-average entry in the Sai...
2,negative,"""The Mayor Of Hell"" has the feel of an early D..."
3,positive,This is a really great short from Hal Roach. T...
4,positive,A rather charming depiction of European union ...


In [29]:
df['label'].value_counts()

negative    25000
positive    25000
Name: label, dtype: int64

Grab a single review.

In [30]:
text = df['review'][0]
print(text)

In the ten years since Wildside aired, nothing has really come close to its quality in local production. This includes the two series of the enjoyable but overrated Underbelly, which have brought to life events in the recent criminal history of both Sydney and Melbourne. The miniseries Blue Murder (which also starred Tony Martin, but as someone on the other side of the law) may be the exception.<br /><br />Wildside is currently being repeated late at night on the ABC. Having not watched the show in quite a while, I'm still impressed by its uncompromising story lines and very human characters. The cast is excellent: Tony Martin as a detective haunted by the disappearance of his son, Rachael Blake (who later hooked up with Martin in real life) as a community worker struggling with alcoholism, and Alex Dimitriades as a young cop whose vice is gambling. Equally good support roles are provided by Aaron Pederson, Jessica Napier, Mary Coustas (yes, Effie herself), and a young Abbie Cornish.<b

Apply our function to remove punctuation.

In [31]:
text_no_punct = remove_punctuation(text)
print(text_no_punct)

In the ten years since Wildside aired nothing has really come close to its quality in local production This includes the two series of the enjoyable but overrated Underbelly which have brought to life events in the recent criminal history of both Sydney and Melbourne The miniseries Blue Murder which also starred Tony Martin but as someone on the other side of the law may be the exceptionbr br Wildside is currently being repeated late at night on the ABC Having not watched the show in quite a while Im still impressed by its uncompromising story lines and very human characters The cast is excellent Tony Martin as a detective haunted by the disappearance of his son Rachael Blake who later hooked up with Martin in real life as a community worker struggling with alcoholism and Alex Dimitriades as a young cop whose vice is gambling Equally good support roles are provided by Aaron Pederson Jessica Napier Mary Coustas yes Effie herself and a young Abbie Cornishbr br The ABC inexplicably releas

Now, let's apply this to each review in the dataframe.

In [32]:
df['review_no_punct'] = df['review'].apply(lambda x: remove_punctuation(x))

In [33]:
df.head()

Unnamed: 0,label,review,review_no_punct
0,negative,"In the ten years since Wildside aired, nothing...",In the ten years since Wildside aired nothing ...
1,positive,This is a better-than-average entry in the Sai...,This is a betterthanaverage entry in the Saint...
2,negative,"""The Mayor Of Hell"" has the feel of an early D...",The Mayor Of Hell has the feel of an early Dea...
3,positive,This is a really great short from Hal Roach. T...,This is a really great short from Hal Roach Th...
4,positive,A rather charming depiction of European union ...,A rather charming depiction of European union ...


## Tokenization

Next, we'll set up a function to first make all the letters lowercase and then split each word into a separate string (giving us a list of strings). Note that the splitting happens based on spaces.

In [39]:
text = "Hello, my name is Brian!"
text = remove_punctuation(text)
text = text.lower()
text.split()

['hello', 'my', 'name', 'is', 'brian']

In [40]:
def tokenize(text):
    tokens = text.lower().split()

    return tokens

Let's run our text without punctuation throught the new tokenization function.

In [41]:
text_tokens = tokenize(text_no_punct)
print(text_tokens)

['in', 'the', 'ten', 'years', 'since', 'wildside', 'aired', 'nothing', 'has', 'really', 'come', 'close', 'to', 'its', 'quality', 'in', 'local', 'production', 'this', 'includes', 'the', 'two', 'series', 'of', 'the', 'enjoyable', 'but', 'overrated', 'underbelly', 'which', 'have', 'brought', 'to', 'life', 'events', 'in', 'the', 'recent', 'criminal', 'history', 'of', 'both', 'sydney', 'and', 'melbourne', 'the', 'miniseries', 'blue', 'murder', 'which', 'also', 'starred', 'tony', 'martin', 'but', 'as', 'someone', 'on', 'the', 'other', 'side', 'of', 'the', 'law', 'may', 'be', 'the', 'exceptionbr', 'br', 'wildside', 'is', 'currently', 'being', 'repeated', 'late', 'at', 'night', 'on', 'the', 'abc', 'having', 'not', 'watched', 'the', 'show', 'in', 'quite', 'a', 'while', 'im', 'still', 'impressed', 'by', 'its', 'uncompromising', 'story', 'lines', 'and', 'very', 'human', 'characters', 'the', 'cast', 'is', 'excellent', 'tony', 'martin', 'as', 'a', 'detective', 'haunted', 'by', 'the', 'disappearance

Now, let's apply that to each review (with punctuation removed).

In [42]:
df['tokens'] = df['review_no_punct'].apply(lambda x: tokenize(x))

In [43]:
df.head()

Unnamed: 0,label,review,review_no_punct,tokens
0,negative,"In the ten years since Wildside aired, nothing...",In the ten years since Wildside aired nothing ...,"[in, the, ten, years, since, wildside, aired, ..."
1,positive,This is a better-than-average entry in the Sai...,This is a betterthanaverage entry in the Saint...,"[this, is, a, betterthanaverage, entry, in, th..."
2,negative,"""The Mayor Of Hell"" has the feel of an early D...",The Mayor Of Hell has the feel of an early Dea...,"[the, mayor, of, hell, has, the, feel, of, an,..."
3,positive,This is a really great short from Hal Roach. T...,This is a really great short from Hal Roach Th...,"[this, is, a, really, great, short, from, hal,..."
4,positive,A rather charming depiction of European union ...,A rather charming depiction of European union ...,"[a, rather, charming, depiction, of, european,..."


## Removing Stop Words

First, import the stopwords corpus from NLTK.

In [44]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

Make a function to remove stop words:

In [46]:
def remove_stopwords(text):    
    text = [word for word in text if word not in stop_words]

    return text

Try that on the text with no punctuation:

In [48]:
text_tokens_nostop = remove_stopwords(text_tokens)
print(text_tokens)
print(text_tokens_nostop)

['in', 'the', 'ten', 'years', 'since', 'wildside', 'aired', 'nothing', 'has', 'really', 'come', 'close', 'to', 'its', 'quality', 'in', 'local', 'production', 'this', 'includes', 'the', 'two', 'series', 'of', 'the', 'enjoyable', 'but', 'overrated', 'underbelly', 'which', 'have', 'brought', 'to', 'life', 'events', 'in', 'the', 'recent', 'criminal', 'history', 'of', 'both', 'sydney', 'and', 'melbourne', 'the', 'miniseries', 'blue', 'murder', 'which', 'also', 'starred', 'tony', 'martin', 'but', 'as', 'someone', 'on', 'the', 'other', 'side', 'of', 'the', 'law', 'may', 'be', 'the', 'exceptionbr', 'br', 'wildside', 'is', 'currently', 'being', 'repeated', 'late', 'at', 'night', 'on', 'the', 'abc', 'having', 'not', 'watched', 'the', 'show', 'in', 'quite', 'a', 'while', 'im', 'still', 'impressed', 'by', 'its', 'uncompromising', 'story', 'lines', 'and', 'very', 'human', 'characters', 'the', 'cast', 'is', 'excellent', 'tony', 'martin', 'as', 'a', 'detective', 'haunted', 'by', 'the', 'disappearance

Now, let's apply that to each review (after tokenization with punctuation removed).

In [49]:
df['tokens_nostop'] = df['tokens'].apply(lambda x: remove_stopwords(x))

In [50]:
df.head()

Unnamed: 0,label,review,review_no_punct,tokens,tokens_nostop
0,negative,"In the ten years since Wildside aired, nothing...",In the ten years since Wildside aired nothing ...,"[in, the, ten, years, since, wildside, aired, ...","[ten, years, since, wildside, aired, nothing, ..."
1,positive,This is a better-than-average entry in the Sai...,This is a betterthanaverage entry in the Saint...,"[this, is, a, betterthanaverage, entry, in, th...","[betterthanaverage, entry, saint, series, hold..."
2,negative,"""The Mayor Of Hell"" has the feel of an early D...",The Mayor Of Hell has the feel of an early Dea...,"[the, mayor, of, hell, has, the, feel, of, an,...","[mayor, hell, feel, early, dead, end, kids, fi..."
3,positive,This is a really great short from Hal Roach. T...,This is a really great short from Hal Roach Th...,"[this, is, a, really, great, short, from, hal,...","[really, great, short, hal, roach, two, main, ..."
4,positive,A rather charming depiction of European union ...,A rather charming depiction of European union ...,"[a, rather, charming, depiction, of, european,...","[rather, charming, depiction, european, union,..."
