Author: Jakidxav

Date: 04.18.2019

This notebook is to test and demonstrate the helper methods I wrote as NLTK wrappers for the DAIC_WOZ NLP project. The goal of these functions is to clarify processing workflow by decluttering the workspace. All of the functions written in `clean_text.py` have the form: `output = function(input)`.

In [1]:
import string
import nltk

from clean_text import *

In [2]:
#example string
example = "<semantics> me This mmm los angeles  &is [an] ExAmPlE? {of} 45 Jennifer string Americans. My with.? <sigh> xxx [] you they theirs punctuation mine michelle, [laugh] by stopwordS wouldn't I'm and whitespace.    Can't wait!!!!"

#create stop words set
stops = set(nltk.corpus.stopwords.words('english'))
punctuation = string.punctuation

### Example processing flow:

In [3]:
#sentence tokenize
sentences = sentence_tokenize(example)

#whitespace tokenize
tokens = tokenize(example)

#take out semantic information
no_semantics = remove_semantics(tokens)

#remove proper nouns: NNP and NNPS are the POS tag for proper nouns
no_nnp = remove_proper_nouns(no_semantics)

#convert back to string to split apart contractions
l2s = list_to_string(no_nnp)

#split contractions AFTER removing semantic information
split_contractions = split_contractions(l2s)

#strip punctuation
no_punct = strip_punctuation(split_contractions)

#convert to lowercase
lower = lower_case(no_punct)

#remove stopwords
no_stops = remove_stopwords(lower)

#remove numbers from tokens
words = remove_numbers(no_stops)

In [4]:
print('Raw Input:\n', example, '\n')
print('Sentence tokenize:\n', sentences, '\n')
print('Tokenize:\n', tokens, '\n')
print('Remove Semantic Info:\n', no_semantics, '\n')
print('Remove Proper Nouns:\n', no_nnp, '\n')
print('Split Contractions:\n', split_contractions, '\n')
print('Remove Punctuation:\n', no_punct, '\n')
print('Lowercase:\n', lower, '\n')
print('Remove Stopwords:\n', no_stops, '\n')
print('Final:\n', words, '\n')

Raw Input:
 <semantics> me This mmm los angeles  &is [an] ExAmPlE? {of} 45 Jennifer string Americans. My with.? <sigh> xxx [] you they theirs punctuation mine michelle, [laugh] by stopwordS wouldn't I'm and whitespace.    Can't wait!!!! 

Sentence tokenize:
 ['<semantics> me This mmm los angeles  &is [an] ExAmPlE?', '{of} 45 Jennifer string Americans.', 'My with.?', "<sigh> xxx [] you they theirs punctuation mine michelle, [laugh] by stopwordS wouldn't I'm and whitespace.", "Can't wait!!!", '!'] 

Tokenize:
 ['<semantics>', 'me', 'This', 'mmm', 'los', 'angeles', '&is', '[an]', 'ExAmPlE?', '{of}', '45', 'Jennifer', 'string', 'Americans.', 'My', 'with.?', '<sigh>', 'xxx', '[]', 'you', 'they', 'theirs', 'punctuation', 'mine', 'michelle,', '[laugh]', 'by', 'stopwordS', "wouldn't", "I'm", 'and', 'whitespace.', "Can't", 'wait!!!!'] 

Remove Semantic Info:
 ['me', 'This', 'mmm', 'los', 'angeles', '&is', 'ExAmPlE?', '{of}', '45', 'Jennifer', 'string', 'Americans.', 'My', 'with.?', 'xxx', 'you'