# Module 11 Text Processing

**Topics**:
- Text Processing
- Tokenizing strings
- Use of the NLTK toolkit
- Loading and using stopwords
- Stemming words
- Lemmatizing words

**Notes**:
- Detecting space, new line, and tab are common in processing text. 

In [1]:
import nltk
import string

In [2]:
# Useful string constants 
print(string.ascii_letters)
print(string.ascii_uppercase)
print(string.ascii_lowercase)
print(string.digits)
print(string.punctuation)
print(string.whitespace)

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
 	



In [3]:
# first character is space, 
string.whitespace

' \t\n\r\x0b\x0c'

### Demo of useful String functions.

In [4]:
s = 'Test'

print(s.lower())

print(s.upper())

test
TEST


In [5]:
s= 'test me.'

print(s.capitalize())

Test me.


In [6]:
s ='test me again'

print(s.title())

Test Me Again


In [7]:
s ='test'

print(s)

test


In [8]:
s.isalpha()

True

In [9]:
t= '321'

t.isalpha()

False

In [10]:
t='?'

t.isalnum()

False

In [11]:
s = 'test123'

s.isdigit()

False

In [12]:
s = 'test123'

s.islower()

True

In [13]:
s = 'Test'

s.isupper()

False

In [14]:
s ='T'

s.isspace()

False

In [15]:
s='\n'

s.isspace()

True

### Tokenize string

In [16]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [17]:
para = "Hello there! It is nice to see you again. Do you have any questions?"

In [18]:
# tekenize the string

sentences = sent_tokenize(para)

len(sentences)

3

In [19]:
print(sentences[0])

print(sentences)

Hello there!
['Hello there!', 'It is nice to see you again.', 'Do you have any questions?']


In [20]:
# Tokenize the first sentence
print(sentences[1])

print(word_tokenize(sentences[1]))

It is nice to see you again.
['It', 'is', 'nice', 'to', 'see', 'you', 'again', '.']


In [21]:
words = word_tokenize(sentences[1])

words

['It', 'is', 'nice', 'to', 'see', 'you', 'again', '.']

In [22]:
for w in words:
    print(w.isalpha())

True
True
True
True
True
True
True
False


### Remove punctuation marks

In [23]:
sentences[1].translate(str.maketrans('','',string.punctuation))

'It is nice to see you again'

In [24]:
new_sentence = sentences[1].translate(str.maketrans('','',string.punctuation))

In [25]:
# tokenize the sentence

words = word_tokenize(new_sentence)

In [26]:
# to check to make sure they are all words
for w in words:
    print(w.isalpha())

True
True
True
True
True
True
True


In [27]:
# Save the tokenized words to a list

words_saved =[]

for w in words:
    if (w.isalpha() == True):
        words_saved.append(w.lower())
        
print(words_saved)

['it', 'is', 'nice', 'to', 'see', 'you', 'again']


### Loading the stop words.

In [28]:
from nltk.corpus import stopwords

In [29]:
english_stop_words = stopwords.words('English')

len(english_stop_words)

179

In [30]:
# Show the first 20 stopwords
english_stop_words[:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

In [31]:
# View how many stop words are in the list of tokenized words

for w in words_saved:
   
    if w in english_stop_words :
        print(w)


it
is
to
you
again


In [32]:
# Make a list of words that are not found in stopwords

words_saved_2 = []

for w in words_saved:
   
    if w not in english_stop_words :
        words_saved_2.append(w)

print(words_saved_2)

['nice', 'see']


### Stemming Words Example

In [33]:
tokens = ['kick','kicks','kicked','kicking']

In [34]:
# Load the NLTK Porter Stemmer
porter = nltk.PorterStemmer()

In [35]:
for w in tokens:
    print(porter.stem(w))

kick
kick
kick
kick


### Lemmatizing Words Example

In [36]:
wnl = nltk.WordNetLemmatizer()

In [37]:
# Lemmatize the list

for w in tokens:
    print(wnl.lemmatize(w))

kick
kick
kicked
kicking


In [38]:
s = 'It is nice to see you again.'

In [39]:
bsu_text = []

in_file = open('bsu_text_1.txt')

for line in in_file:
    bsu_text.append(line)
    
in_file.close()

len(bsu_text)

2

In [40]:
print(bsu_text)

['Ball State University has partnered with Coursera, one of the largest online learning platforms in the world, to bring you the online master of science in data science. \n', 'The curriculum is designed by Ball State University, courses are taught by our accomplished faculty, and content is delivered through the Coursera platform.']
