# Text Analysis

In this module, we will use the Natural Language Toolkit Library (NLTK) to look at individual words and sentences in a text and clean unneccessary features from the text data to prepare for sentiment analysis. Then using the textblob library, we will analyze the sentiment of opinioned data to give a numerical value for use in a predictive model.

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#this is sample data
from nltk.corpus import names  

from string import punctuation

#if the next cell does not work
#remove number symbol (comment) on following lines and re-run this cell
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('names')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/lauracutrer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lauracutrer/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package names to
[nltk_data]     /Users/lauracutrer/nltk_data...
[nltk_data]   Package names is already up-to-date!


True

In [3]:
with open('datasets_12dancingprincesses.txt', encoding='cp1252') as f:
    for line in f:
        print(line)

THE TWELVE DANCING PRINCESSES



There was a king who had twelve beautiful daughters. They slept in

twelve beds all in one room; and when they went to bed, the doors were

shut and locked up; but every morning their shoes were found to be quite

worn through as if they had been danced in all night; and yet nobody

could find out how it happened, or where they had been.



Then the king made it known to all the land, that if any person could

discover the secret, and find out where it was that the princesses

danced in the night, he should have the one he liked best for his

wife, and should be king after his death; but whoever tried and did not

succeed, after three days and nights, should be put to death.



A king’s son soon came. He was well entertained, and in the evening was

taken to the chamber next to the one where the princesses lay in their

twelve beds. There he was to sit and watch where they went to dance;

and, in order that nothing might pass without his hearing it, the

In [22]:
#create an empty list here to hold the tokens at the end
tokensPrincess = []


with open('datasets_12dancingprincesses.txt', encoding='cp1252') as f:
    for line in f:
        cline = line.strip() #get rid of newline character

        if cline == '': pass  #this will skip over lines that only had a newline and are now blank
        else:
            tknls = word_tokenize(cline)

            for token in tknls:
                #write the function to append each token to the empty list you created at the start of this code]
                tokensPrincess.append(token)

In [23]:
len(tokensPrincess)

1803

In [24]:
tokensPrincess_copy = tokensPrincess

In [28]:
len(tokensPrincess_copy)

1617

#### Tokenizing Words and Sentences

Recall in the "Python Dictionaries and String Manipulation" notebook, we used the .split() function to break a sentence apart.

In [26]:
#remove the puntuation tokens from the list
for word in tokensPrincess:
    if word in punctuation:
        tokensPrincess.remove(word)

In [29]:
print(len(tokensPrincess))
print(len(tokensPrincess_copy))

1617
1617


In [30]:
#list of english stopwords
eng_stopwords = stopwords.words('english')
eng_stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [31]:
#rm_count = 0
new_tokensPrincess = []  #list to hold new words

for word in tokensPrincess:
    if word not in eng_stopwords:
        new_tokensPrincess.append(word)
    #else: rm_count += 1

In [32]:
len(new_tokensPrincess)

754

In [33]:
#the NLTK FreqDist gives a count for how often each part of the text occurs
fd_wct = FreqDist(new_tokensPrincess)
fd_wct

FreqDist({'--': 2,
          ';': 1,
          'A': 1,
          'After': 1,
          'And': 6,
          'As': 2,
          'At': 1,
          'But': 3,
          'DANCING': 1,
          'He': 2,
          'However': 1,
          'I': 11,
          'In': 1,
          'Just': 1,
          'Now': 1,
          'On': 1,
          'One': 1,
          'PRINCESSES': 1,
          'So': 1,
          'THE': 1,
          'TWELVE': 1,
          'That': 1,
          'The': 4,
          'Then': 8,
          'There': 3,
          'They': 2,
          'When': 4,
          'able': 1,
          'adventure': 1,
          'afraid': 1,
          'afterwards': 1,
          'already': 1,
          'always': 2,
          'another': 2,
          'answered': 2,
          'approach.’': 1,
          'asked': 4,
          'asleep': 2,
          'asleep.’': 1,
          'away': 4,
          'awoke': 1,
          'back': 1,
          'battle': 1,
          'beautiful': 1,
          'beautifully': 1,
          'bec

In [34]:
fd_wct.most_common(10)

[('soldier', 19),
 ('said', 16),
 ('princesses', 16),
 ('king', 12),
 ('went', 11),
 ('I', 11),
 ('twelve', 10),
 ('came', 10),
 ('eldest', 10),
 ('Then', 8)]