# Text Analytics for Beginners using NLTK
(and a whole lot of other things...)

A heavily revised take on Datacamp's tutorial: https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk


## Overview

Import NLTK and the necessary tokenizing features. 

Explain imports

In [3]:
import nltk
from nltk.tokenize import sent_tokenize

Tokenize text using the nltk function sent_tokenize

In [5]:
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]


In [6]:
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']


In [7]:
## Frequency Distribution

In [8]:
from nltk.probability import FreqDist
fdist = FreqDist(tokenized_word)
print(fdist)

<FreqDist with 25 samples and 30 outcomes>


In [9]:
fdist.most_common(2)

[('is', 3), (',', 2)]

In [10]:
#Frequency Distribution Plot
import matplotlib.pyplot as plt
fdist.plot(30,cumulative=False)
plt.show()

<Figure size 640x480 with 1 Axes>

# Stopwords
The "little words" or words used so frequently that have "low semantic" weight. Their presence can skew results and make interpretation challenging. 

In [11]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

{"she's", 'with', "mightn't", 'wouldn', 'your', 'can', "you're", 'should', 'mightn', 'ain', 'them', 'when', 'about', "isn't", 'don', 'to', 'other', 'again', 'there', 'here', 'after', 'up', 'own', 'then', 's', 'until', 'ourselves', 'yours', 'isn', 'has', 'i', 'off', 'than', 'more', "haven't", "doesn't", "couldn't", 'you', "you'll", 'what', 'been', 'few', 'her', 'yourself', 'shan', 'd', 'how', "that'll", 'themselves', 'for', 'were', "hasn't", 'all', 'shouldn', 'herself', 'nor', 'once', 'm', 'was', 'y', 'why', "you've", 'she', 'into', 't', 'now', 'during', 'he', 'at', "wouldn't", 'while', 'as', 'by', 'our', "it's", 'which', 'they', 'the', "should've", 'it', 'needn', 'down', 'each', 'did', 'mustn', 'out', 'aren', 'and', 'itself', 'myself', 'because', 'no', 'in', 'hers', 'hasn', 'wasn', 'against', 'just', 'both', 'of', 'is', "don't", "shan't", 'between', 'very', 'too', "hadn't", "wasn't", 'over', 'before', 'doesn', 'or', 'above', 'an', 'had', 'if', 'am', "mustn't", 've', 'under', "shouldn't

In [12]:
filtered_sent=[]
for w in tokenized_sent:
    if w not in stop_words:
        filtered_sent.append(w)
print("Tokenized Sentence:",tokenized_sent)
print("Filterd Sentence:",filtered_sent)

Tokenized Sentence: ['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]
Filterd Sentence: ['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]


# Lexicon Normalization 

What do we do about words that mean very similar things and have similar root words, but they are different "strings" of characters in the text? Does it make sense to treat each instance individually, or should we combine them? There's a process that we use in preparing texts for computational analysis that combines words that have similar or the same meanings but that differ because of their tense, number, or possession, for example. 

In [13]:
## Stemming
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps=PorterStemmer()

stemmed_words=[]
for w in filtered_sent:
    stemmed_words.append(ps.stem(w))

print("Filtered Sentence:",filtered_sent)
print("Stemmed Sentence:",stemmed_words)

Filtered Sentence: ['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]
Stemmed Sentence: ['hello mr. smith, how are you doing today?', 'the weather is great, and city is awesome.', 'the sky is pinkish-blue.', "you shouldn't eat cardboard"]


# Lemmatization

When you lemmatize your corpus, you strip a word down to its "lemma" or stem. It transforms the word in order to combine similar "sense" or "meanings." Adverbs and adjectives, for example are words that are useful to lemmatize because they are similar and frequent enough to be considered "noise" in the text, but at the same time, they are not the same word root. The example from the lesson is "good" and "better." 

As the tutorial points out, stemming doesn't work because these words don't have the same root, and this process requires looking words up in a lexicon, or dictionary. The word "dictionary" will be used to mean several things as we proceed. It's good to check to make sure you're clear on which meaning of the word is intended. 

In [14]:
# Lexicon Normalization
# performing stemming and Lemmatization

from nltk.stem.wordnet import WordNetLemmatizer
lem=WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem=PorterStemmer()

word="flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))

Lemmatized Word: fly
Stemmed Word: fli


# Part-of-Speech (POS) tagging 
When you tag a corpus for part-of-speecy, you're looking at the grammatical construction of the text. You will utilize a _dictionary_ to assign a _tag_ with the part of speech label to the text. For example, is each word a noun, pronoun, adjective, verb, adverbs, etc. Assessing part of speech is dependent on context. 

In [15]:
sent = "Albert Einstein was born in Ulm, Germany in 1879."

In [16]:
tokens=nltk.word_tokenize(sent)
print(tokens)

['Albert', 'Einstein', 'was', 'born', 'in', 'Ulm', ',', 'Germany', 'in', '1879', '.']


In [17]:
nltk.pos_tag(tokens)

[('Albert', 'NNP'),
 ('Einstein', 'NNP'),
 ('was', 'VBD'),
 ('born', 'VBN'),
 ('in', 'IN'),
 ('Ulm', 'NNP'),
 (',', ','),
 ('Germany', 'NNP'),
 ('in', 'IN'),
 ('1879', 'CD'),
 ('.', '.')]

# Sentiment Analysis
There's a great bit of interest in assessing the "feelings" associated with particular vocabularies. We'll talk in much greater depth about Sentiment Analysis as we go, but for now, we're just going to focus on the kind of "cleaning" and preparation that's required of a text, in addition to what kind of descriptive process / labeling is required in order to make this analysis work. For now, what is worth focusing on which _type_ of sentiment analysis you (or the experiment you're reading about) is interested in. 
* Lexicon-based: counts the number of positive and negative words in a text. The aggregate of a string of texts equals the "sentiment" of the text.
* Machine-learning based approach: Develop a classification model based on a pre-defined set of labels and then essentially classify the text based on the similarities and differences. These are classified typically as positive, negative, and neutral. 

# Text Classification

insert section here about how to read in data. 

In [22]:
#import pandas
import pandas as pd

data=pd.read_csv('train.tsv', sep='\t')

data.head()

FileNotFoundError: [Errno 2] File b'train.tsv' does not exist: b'train.tsv'

In [56]:
data.info()

NameError: name 'data' is not defined

In [57]:
data.Sentiment.value_counts()

NameError: name 'data' is not defined