Hello everyone. My name is Sean Ru. I am a third-year CS major here at the Data Visualization Lab.

For today's workshop, we will be using the NLTK library to walk you through some basic steps of a text analytics project. To be more specific, we will be talking word/term frequency in texts. For those of you who don't know, NLTK is a library used to work with human language data.

For those of you who have heard of text analytics before, you guys might also think of text mining and text analysis, which are not the same as text analytics.

According to a blog called monkeylearn, “Text analytics is the automated process of translating large volumes of unstructured text into quantitative data to uncover insights, trends, and patterns. Combined with data visualization tools, this technique enables companies to understand the story behind the numbers and make better decisions." 

On the other hand, text mining and text analysis are the same thing. It’s the process of transforming unstructured text into structured data for easy analysis through utilizing Natural Language Processing, which is the process of training machines/programs to understand the human language and process it. 

So you might be thinking, well that sounds really similar to text analytics. Well the difference between text mining and text analytics is actually that, text mining delivers qualitative results and text analytics delivers quantitative results.


NLTK is one of the most popular tools to process human language data. 

Some basic steps of text analysis we are going to demonstrate include:

       -tokenize text
       -clean punctuations 
       -remove stop words 
       -stemm words 
       -tag words 
       -vocabulary diversity
       -word frequency distribution
We actually are not going to go over sentiment analysis, sorry about that, sentiment analysis is actually text mining/analysis. So I'll have to fix that. So, let's start with some of the majory libraries we will have to use to import the files.

In [None]:
from pathlib import Path #provides an object api for working with files and directories
import pandas as pd #library used for data science and machine learning
import os #provides functions for interacting with operating systems
import glob #used to return all file paths that match a specific pattern
import sys #provides functions and variables used to manipulate different part of the Python runtime environment

### Import files

Make sure to import Sample_data into Google Colab in files.

In order to import a folder of files, we use the os.chdir function to first navigate to the right directory.

Then we use glob.glob function to iterate through all files.

In [None]:
my_dir = "Sample_data"
os.chdir(my_dir)   #change the current working directory to specified path. 

In [None]:
reviewList=[]
#code through here
for files in glob.glob("*.txt"):   #glob.glob returns a list of pathnames. It helps us loop through all files that are .txt in the sample folder
    df = pd.read_csv(files) #dataframe, data structure that organizes data into a 2-dimensional table of rows and columns, like a spreadsheet
    #print(df)
    for content in df:  
        reviewList.append(content) #add all the data (or in this case the strings in the files in sample data) to this list
print (reviewList) #see the list of Strings from all the .txt files

Convert the review list into a huge string.

In [None]:
str1 = " " #String that will combine all the strings in the reviewList into 1 huge string, want it as a bag of words
data = str1.join(reviewList) #combines all the strings, data is a string
#allows us to not have to use anymore loops to do the same function for all the separate strings in the reviewList
data = data.replace("<br />","") #deletes any breaks or \n
print (data)

# tokens = nltk.word_tokenize(str(words)) #convert words from list to string and tokenize them
# print (tokens)

### Remove punctuation and stop words
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.

Stop words are words that are so commonly used that they carry very little useful information.

http://www.nltk.org/nltk_data/

In [None]:
import nltk
# nltk.download_shell() for mac users

In [None]:
# import re, pprint
from nltk.corpus import stopwords #stopwords are words that can be safely ignored, they don't add much meaning to a sentence outside of grammar
from nltk.tokenize import word_tokenize #word is splits a string into individual words called tokens

In [None]:
nltk.download('punkt')

punctuations = "$?:!.,;/\\&*)--(...-``''" #define

#tokens = word_tokenize(data) #convert words from list to string and tokenize them
tokens = word_tokenize(data.replace("."," ")) #fill the periods with spaces.
words = [word.lower() for word in tokens if not word in punctuations] #lowercase everything and if it isn't a punctuation add it to words
print(sorted(words))

In [None]:
nltk.download('stopwords')

stop_words = set(stopwords.words('english')) #tell it we want english

## Add extra stop words after viewing the results
# stop_words.add("m")
# stop_words.add("'s")
# stop_words.remove("yourself")
# print (sorted(stop_words))

filtered_words = [word for word in words if not word in stop_words] #if not a stop word, add to filtered_words
print(sorted(filtered_words)) #sort it aphabetically

### Stemming and Lemmatization

To reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

In [None]:
from nltk.stem import PorterStemmer 
#process for removing the commoner morphological and inflexional endings from words in English, liked,likes->like
#simplify to the common base word
from nltk.stem import SnowballStemmer
#improved PorterStemmmer

stemmer = PorterStemmer()
stemmer2 = SnowballStemmer("english") #tell SnowballStemmer the language is English

stem_words = [stemmer2.stem(words) for words in filtered_words]
print (sorted(stem_words))

In [None]:
nltk.download('wordnet') #lexical database for English (nouns, adjectives, adverbs, verbs)
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer 
#the process of grouping together the different inflected forms of a word so they can be analyzed as a single item
wnl = WordNetLemmatizer()
print(wnl.lemmatize("cats"))
print(wnl.lemmatize("giggling"))
print(wnl.lemmatize("giggling", "v"))

lemm_words = [wnl.lemmatize(word) for word in filtered_words]
print(sorted(lemm_words))


### Speech Tagging

    CC   coordinating conjunction
    CD   cardinal digit
    DT   determiner
    EX   existential there (like: "there is" ... think of it like "there exists")
    FW   foreign word
    IN   preposition/subordinating conjunction
    JJ   adjective 'big'
    JJR   adjective, comparative 'bigger'
    JJS   adjective, superlative 'biggest'
    LS   list marker 1)
    MD   modal could, will
    NN   noun, singular 'desk'
    NNS   noun plural 'desks'
    NNP   proper noun, singular 'Harrison'
    NNPS   proper noun, plural 'Americans'
    PDT   predeterminer 'all the kids'
    POS   possessive ending parent's
    PRP   personal pronoun I, he, she
    PRP$   possessive pronoun my, his, hers
    
    RB   adverb very, silently
    RBR   adverb, comparative better
    UH   interjection errrrrrrrm
    VB   verb, base form take
    VBD   verb, past tense took
    VBG   verb, gerund/present participle taking
    VBN   verb, past participle taken
    VBP   verb, sing. present, non-3d take
    VBZ   verb, 3rd person sing. present takes
    WDT   wh-determiner which
    WP   wh-pronoun who, what
    WP$   possessive wh-pronoun whose
    WRB   wh-abverb where, when
    RBS   adverb, superlative best
    RP   particle give up
    TO   to go 'to' the store


#### 1. Use tags for lemmatization

pos_tag gets the tag for the word, it comes in form of a list of tuples[(word1, tag1)(word2, tag2)(word3, tag3)].

Use indexing to drill down: the first[0] gets to the individual tuples, the [1] gets to the tags, and the [0] grabs the first letter of a tag.

In [None]:
#tuples are collections of objects in order
from nltk.corpus import wordnet  
nltk.download('averaged_perceptron_tagger') #used for tagging words with their parts of speech (POS)

#WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs 
#are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. 
#reminder, we removed stop words, so tags like determiners that included 'a', 'an', and 'the' will not be used as often
#you'll see that the count for determiners, DT isn't 0, and that's becuz either we didn't actually remove all the stop words
#through our code, or the pretrained model may have mistagged them. that's fine becuz this is an intro workshop and
#even tho there is some error, we still largely get the job done

#so by mistagging....
print(nltk.pos_tag(["meaningful"])[0][1][0])
#pretrained models can have errors

#We are tagging each word so we can lemmatize it to common base form
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

#print (get_wordnet_pos("facing"))
#print (get_wordnet_pos("kindly"))

lemm_words = [wnl.lemmatize(w, get_wordnet_pos(w)) for w in filtered_words]
print (sorted(lemm_words)[0:100])

#### 2. Tagging our data

nltk.pos_tag() returns a tuple with the POS tag. The key here is to map NLTK’s POS tags to the format wordnet lemmatizer would accept. The get_wordnet_pos() function defined below does this mapping job.

Reference: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

In [None]:
text = word_tokenize("faking a review for tagging purpose")
nltk.pos_tag(text)

In [None]:
from collections import Counter
#count hashable objects

tags = nltk.pos_tag(lemm_words)
#print (tags[:5])

tag_counts = Counter(tag for word,tag in tags)
print (tag_counts)


### Counting Words

#### 1. Check Vocabulary Diversity
set() creates a distinct collection of the iterable elements (all words here).

In [None]:
print (len(set(lemm_words)))  # number of distinct words
print (len(lemm_words))    # number of total words
print ("The vocabulary diversity of the reviews is: "+ str(len(set(lemm_words))/len(lemm_words)))

#### 2. Count total words and unique words

In [None]:
# create a dictionary to store uniques words and their counts.
count = {}
for w in lemm_words:
    if w in count:
        count[w] += 1
    else:
        count[w] = 1
    
count = {k: v for k, v in sorted(count.items(), key=lambda item: item[1], reverse=True)} 
# count has two items: key and value
# word is the key and count is the value
# we are sorting from greatest to smallest
# "key=lambda" allows us to sort our dictionary by value. 
# This is an example of a Lambda function, which is a function without a name.
# default it's ascending sort, "reverse=True" flips the order to descending.

print (count)

#### 3. NLTK's Frequency Distributions Functions

    fdist = FreqDist(samples)	create a frequency distribution containing the given samples
    fdist[sample] += 1	increment the count for this sample
    fdist['monstrous']	count of the number of times a given sample occurred
    fdist.freq('monstrous')	frequency of a given sample
    fdist.N()	total number of samples
    fdist.most_common(n)	the n most common samples and their frequencies
    for sample in fdist:	iterate over the samples
    fdist.max()	sample with the greatest count
    fdist.tabulate()	tabulate the frequency distribution
    fdist.plot()	graphical plot of the frequency distribution
    fdist.plot(cumulative=True)	cumulative plot of the frequency distribution
    fdist1 |= fdist2	update fdist1 with counts from fdist2
    fdist1 < fdist2	test if samples in fdist1 occur less frequently than in fdist2

In [None]:
from nltk import FreqDist #frequency distribution

freq_words=FreqDist(lemm_words)
freq_words.most_common(30)

In [None]:
! pip install matplotlib #creating static, animated, and interactive visualizations

In [None]:
import matplotlib.pyplot as plt #basically matlab
#plt.figure(figsize=(12, 5))  #(x,y)
#plt.title("Cummulative Frequency Distribution")
plot1 = FreqDist(lemm_words).plot(30, cumulative=True, color="black") #frequency distribution

plt.figure(figsize=(12, 5))  
plt.title("Non-cummulative Frequency Distribution")
plot2 = FreqDist(lemm_words).plot(30, cumulative=False, color="purple")

### Dispersion Plot

Show the location of words in the collection.

In [None]:
from nltk.draw.dispersion import dispersion_plot 
#allows for visualization of the lexical dispersion of words in a corpus, which is a collection of texts
dispersion_plot(lemm_words, ['movie', 'scene','character'])

Reference: 

        https://www.nltk.org/book/ch01.html
        https://www.nltk.org/

