## Preparing to work with unstructured data

Import the following libraries by adding the following command in your Jupyter Notebook and run the cell

In [1]:
import nltk

Next, we download the specific corpus we want to use. Alternatively you can download all the packages using the all parameter. If you are behind a firewall, there is an nltk.set_proxy option available. Check the documentation on http://www.nltk.org/ for more details:

In [2]:
nltk.download('brown')

[nltk_data] Downloading package brown to /home/nbuser/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

To confirm the package is available in your Jupyter Notebook, we can use the following command to reference the corpus using a common alias of brown for reference:

In [3]:
from nltk.corpus import brown

To display a list of the few words available in the Brown Corpos, use the following command:

In [4]:
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

To count all the words available we can use the len() function which counts the length of a string or the number of items in an object like an array of values. Since our values are separated by commas, it will count all the words available. To make it easier to format, let’s assign the output to a variable called count_of_words which we can use in the next step:

In [6]:
count_of_words = len(brown.words())

To make the output easier to understand to the consumer of this data, we use the print() and format() functions to display the results using the following command:

In [7]:
print('Count of all the words found the Brown Corpus =',format(count_of_words,',d'))

Count of all the words found the Brown Corpus = 1,161,192


## What is Tokenization and why is important

Import the following libraries by adding the following command in your Jupyter Notebook and run the cell. Feel free to follow along by creating your own Notebook and I have placed a copy in GitHub for reference:

In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/nbuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Next we will create a new variable called input_sentence and assign it to a free form text sentence which must be encapsulated in double quotes. There will be no input after you run the cell:

In [10]:
input_sentence = "Seth and Becca love to run down to the playground when the weather is nice."

Next we will use the word_tokenize() function that is available in the NLTK library to break up the individual words and any punctuation:

In [11]:
nltk.word_tokenize(input_sentence)

['Seth',
 'and',
 'Becca',
 'love',
 'to',
 'run',
 'down',
 'to',
 'the',
 'playground',
 'when',
 'the',
 'weather',
 'is',
 'nice',
 '.']

Next lets tokenize by sentence which requires you to import the sent_tokenize option from the NLTK tokenize library using the following command:

In [12]:
from nltk.tokenize import sent_tokenize

Now let’s assign a new variable called input_data to a collection of sentences which we can use later in our code. There will be no input after you run the cell:

In [13]:
input_data = "Seth and Becca love the playground.  When it sunny, they head down there to play."

Then we will pass the variable input_data as a parameter to the sent_tokenize() function which will look at the string of text and break them out as individual token values. We wrap the output with the print() function to display the results cleaner in the Notebook:

In [15]:
print(sent_tokenize(input_data))

['Seth and Becca love the playground.', 'When it sunny, they head down there to play.']


In [14]:
from nltk.tokenize import word_tokenize
tokenized_word=nltk.word_tokenize(input_sentence)
print(tokenized_word)

['Seth', 'and', 'Becca', 'love', 'to', 'run', 'down', 'to', 'the', 'playground', 'when', 'the', 'weather', 'is', 'nice', '.']


## Counting words and exploring results

Import the probability module available in the NTLK library to count the frequency of the words available in a body of text. There will be no result returned after you run the cell:

In [18]:
from nltk.probability import FreqDist

Next explore a large body of text using the Brown Corpus. To do this, we assign the population of all the token words available by using the FreqDist() function and assigning it to a variable named input_data. To see the results of the processing of this data, we can print the variable:

In [19]:
input_data = FreqDist(brown.words())
print(input_data)

<FreqDist with 56057 samples and 1161192 outcomes>


To see a list of the most common words that exist in our input_data, we can use the most_common() function along with a parameter to control how many are displayed. In this case, we want to see the top 10:

In [20]:
input_data.most_common(10)

[('the', 62713),
 (',', 58334),
 ('.', 49346),
 ('of', 36080),
 ('and', 27915),
 ('to', 25732),
 ('a', 21881),
 ('in', 19536),
 ('that', 10237),
 ('is', 10011)]

## Normalizing text techniques

Import the PorterStemmer module available in the NTLK library to normalize a word. There will be no result returned after you run the cell:

In [21]:
from nltk.stem import PorterStemmer

To import an instance of this feature so it can be referenced later in the code, we use the following code. There will be no result returned after you run the cell:

In [22]:
my_word_stemmer = PorterStemmer()

Now you can pass individual words into the instance to see how the word would be normalized:

In [23]:
my_word_stemmer.stem('fishing')

'fish'

To use the lemmas features, we need to download the WordNet corpus using the following command:

In [24]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/nbuser/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

To import an instance of this feature so it can be referenced later in the code, we use the following code. There will be no result returned after you run the cell:

In [26]:
from nltk.stem import WordNetLemmatizer
my_word_lemmatizer = WordNetLemmatizer()

To see how the lemma would output for the same word we used a stem for earlier, we pass the same word into the lemmatize() function:

In [27]:
my_word_lemmatizer.lemmatize('fishing')

'fishing'

To create a list but limit the words to only a sample by assigning it to a variable, we use the following command. There will be no result returned after you run the cell:

In [28]:
my_list_of_words= brown.words()[:10]

Now we create a loop against each value in the list and print the results. We include some formatting to make it easier to understand the results for each row:

In [29]:
for x in my_list_of_words:
    print('word =', x, ': stem =', my_word_stemmer.stem(x), ': lemma =', my_word_lemmatizer.lemmatize(x))

word = The : stem = The : lemma = The
word = Fulton : stem = Fulton : lemma = Fulton
word = County : stem = Counti : lemma = County
word = Grand : stem = Grand : lemma = Grand
word = Jury : stem = Juri : lemma = Jury
word = said : stem = said : lemma = said
word = Friday : stem = Friday : lemma = Friday
word = an : stem = an : lemma = an
word = investigation : stem = investig : lemma = investigation
word = of : stem = of : lemma = of


## Excluding words from analysis

Download the stopwords corpus from the NLTK library using the following command:

In [30]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/nbuser/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Next, import the stopwords and word_tokenize features so they can be used later in the exercise:

In [31]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

Now let’s assign a new variable called input_data to a collection of sentences which we can use later in our code. There will be no input after you run the cell:

In [32]:
input_data = "Seth and Becca love the playground.  When it sunny, they head down there to play."

We will assign object variables called stop_words and word_tokens so they can be referenced later in the code:

In [33]:
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(input_data)

Finally we have a few lines of code that will loop through the word tokens from the input_data and compare them to the stop_words. If they match, they will be excluded. The final result prints the original input_data which has been tokenized along with the results after the stopwords have been removed. Be sure to use the correct indentation when entering the code:

In [35]:
input_data_cleaned = [x for x in word_tokens if not x in stop_words]
input_data_cleaned = []

for x in word_tokens:
    if x not in stop_words:
        input_data_cleaned.append(x)       

print(word_tokens)
print(input_data_cleaned)

['Seth', 'and', 'Becca', 'love', 'the', 'playground', '.', 'When', 'it', 'sunny', ',', 'they', 'head', 'down', 'there', 'to', 'play', '.']
['Seth', 'Becca', 'love', 'playground', '.', 'When', 'sunny', ',', 'head', 'play', '.']
