# Tokenizing Data
To get the data ready for machine learning, we need to tokenize and filter out one-word occurrences before we can vectorize the data and fit a model.


In [15]:
import nltk
from nltk import bigrams
from nltk.tokenize import word_tokenize
import pandas as pd

data = pd.read_csv('data/moviedata.csv')

In [16]:
data.head()

Unnamed: 0,movie,character_name,line_num,line
0,American Psycho,Bateman,0,"we're sitting in pastels, this nouvelle northe..."
1,American Psycho,Bateman,1,you'll notice that my friends and i all look a...
2,American Psycho,Bateman,2,or can it be worn with a suit?
3,American Psycho,Bateman,3,with discreet pinstripes you should wear a sub...
4,American Psycho,Bateman,4,van patten looks puffy. has he stopped working...


## Tokenizing 
Our first step is to tokenize. We will include unigrams *and* bigrams in our set.

In [17]:
data2 = data.dropna().copy(deep=True)

unigrams = data2['line'].copy(deep=True).apply(word_tokenize)
bigrams = data2['line'].copy(deep=True
                    ).apply(word_tokenize
                    ).apply(bigrams
                    ).apply(list
                    ).apply(lambda x: ['_'.join(bigram) for bigram in x])

data2['tokens'] = unigrams + bigrams

In [18]:
data2['tokens'][2]

['or',
 'can',
 'it',
 'be',
 'worn',
 'with',
 'a',
 'suit',
 '?',
 'or_can',
 'can_it',
 'it_be',
 'be_worn',
 'worn_with',
 'with_a',
 'a_suit',
 'suit_?']

## Removing 1-Count Occurrences
Before including the tokens in the final data set, we will need to filter out 1-count occurrences from the unigrams.

In [19]:
# The resulting from the code below is found in 
# data/moviedata_tokens.csv, so you don't have 
# to run the code below. However, if you want to 
# for any reason, uncomment and run!
'''
# flatten the list of unigram tokens into a single list of words
words = [word for token_list in unigrams for word in token_list]

# create frequency distribution of the words
freq_dist = nltk.FreqDist(words)

# Filter out words with a count of 1
uni_filtered_words = [word for word in words if freq_dist[word] > 1]

# Create unigrams column
data2['unigram_tokens'] = data2['tokens'].apply(
    lambda x: [word for word in x if word in uni_filtered_words])
)

# Combine filtered unigrams with bigrams
bi_words = [word for token_list in bigrams for word in token_list]
filtered_words = uni_filtered_words + bi_words

# Remove 1-count occurrences from the tokenized text column
data2['tokens_filtered'] = data2['tokens'].apply(
    lambda x: [word for word in x if word in filtered_words])

data2['tokens_filtered'].head()

data3 = data2.drop(columns='tokens').rename(
    columns = {'tokens_filtered': 'bigram_unigram_tokens'})

'''

In [20]:
# Uncomment if you ran the code above and need to
# re-export to csv. (not recommended)

#data3.to_csv('data/moviedata_tokens.csv', index=False)

## Next Steps
Now that we've got our tokenized data, we can move onto vectorizing and fitting our model.