Author: Joshua Driscol


### Introduction </br>

This notebook is to test and demonstrate the helper methods I wrote as NLTK wrappers for the DAIC_WOZ NLP project. The goal of these functions is to clarify processing workflow by decluttering the workspace. All of the functions written in `clean_text.py` have the form: `output = function(input)`.

In [1]:
import string
import nltk

from clean_text import *

In [2]:
#example string
example = "<semantics> This   &is [an] ExAmPlE? {of} 45 string. with.? <sigh> [] punctuation, [laugh] by stopwordS and whitespace.    Can't wait!!!!"

#create stop words set
stops = set(nltk.corpus.stopwords.words('english'))
punctuation = string.punctuation

### Example processing flow:

In [3]:
#whitespace tokenize, **preserve contractions**
tokens = tokenize(example)

#take out semantic information
no_semantics = remove_semantics(tokens)

#strip punctuation
no_punct = strip_punctuation(no_semantics)

#convert to lowercase
lower = lower_case(no_punct)

#remove stopwords
no_stops = remove_stopwords(lower)

#remove numbers from tokens
words = remove_numbers(no_stops)

In [4]:
print('Raw Input:\n', example, '\n')
print('Tokenize:\n', tokens, '\n')
print('Remove Semantic Info:\n', no_semantics, '\n')
print('Remove Punctuation:\n', no_punct, '\n')
print('Lowercase:\n', lower, '\n')
print('Remove Stopwords:\n', no_stops, '\n')
print('Final:\n', words, '\n')

Raw Input:
 <semantics> This   &is [an] ExAmPlE? {of} 45 string. with.? <sigh> [] punctuation, [laugh] by stopwordS and whitespace.    Can't wait!!!! 

Tokenize:
 ['<semantics>', 'This', '&is', '[an]', 'ExAmPlE?', '{of}', '45', 'string.', 'with.?', '<sigh>', '[]', 'punctuation,', '[laugh]', 'by', 'stopwordS', 'and', 'whitespace.', "Can't", 'wait!!!!'] 

Remove Semantic Info:
 ['This', '&is', 'ExAmPlE?', '{of}', '45', 'string.', 'with.?', 'punctuation,', 'by', 'stopwordS', 'and', 'whitespace.', "Can't", 'wait!!!!'] 

Remove Punctuation:
 ['This', 'is', 'ExAmPlE', 'of', '45', 'string', 'with', 'punctuation', 'by', 'stopwordS', 'and', 'whitespace', 'Cant', 'wait'] 

Lowercase:
 ['this', 'is', 'example', 'of', '45', 'string', 'with', 'punctuation', 'by', 'stopwords', 'and', 'whitespace', 'cant', 'wait'] 

Remove Stopwords:
 ['example', '45', 'string', 'punctuation', 'stopwords', 'whitespace', 'cant', 'wait'] 

Final:
 ['example', 'string', 'punctuation', 'stopwords', 'whitespace', 'cant', 