<a href="https://colab.research.google.com/github/GTLibraryDataVisualization/Introduction-to-Text-Analytics-with-Python/blob/master/Final_Text_Analytics_with_Python_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This workshop we will be using the NLTK library to walk you through some basic steps of a text analytics project. NLTK is a library used to work with human language data.

"Text analytics is the automated process of translating large volumes of unstructured text into quantitative data to uncover insights, trends, and patterns. Combined with data visualization tools, this technique enables companies to understand the story behind the numbers and make better decisions." -monkeylearn.com

NLTK is one of the most popular tools to process human language data. 

Some basic steps of text analysis we are going to demonstrate include:

       -tokenize text
       -clean punctuations 
       -remove stop words 
       -stem and lemmatize words 
       -tag words 
       -vocabulary diversity
       -word frequency distribution

In [None]:
# Imports

from pathlib import Path #provides an object api for working with files and directories
import os #provides functions for interacting with operating systems
import glob #used to return all file paths that match a specific pattern
import sys #provides functions and variables used to manipulate different part of the Python runtime environment

### Import files

Make sure to import sample_data into Google Colab in files.

In order to import a folder of files, we use the os.chdir function to first navigate to the right directory.

Then we use glob.glob function to iterate through all files.

In [None]:
my_dir = "sample_data"
os.chdir(my_dir)   # change the current working directory to specified path. 

In [None]:
os.getcwd() # verify that we are in the right directory

In [None]:
reviewList=[]
# code through here
for file in glob.glob("*.txt"):   # glob.glob returns a list of pathnames. It helps us loop through all files with a .txt extension 
    with open(file, "r") as f:
        content = f.readlines()
        for line in content:
            reviewList.append(line) # add all the data (or in this case the strings in the files in sample data) to this list
print (reviewList) # see the list of Strings

We want a bag of words, so we convert the review list into a huge string.

In [None]:
str1 = " " # String that will combine all the strings in the reviewList into one huge string
data = str1.join(reviewList)

data = data.replace("<br />","") # deletes any breaks...
data = data.replace("\n", "") # and \n (newline characters)
data = data.replace("."," ") # Remove sentence structure by removing periods

data = data.lower() # Normalize any capitalization

print (data)

### Remove punctuation and stop words
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.

Stop words are words that are so commonly used that they carry very little useful information.

We will be using the NLTK to help us.
http://www.nltk.org/nltk_data/

In [None]:
import nltk
# nltk.download_shell() for mac users

In [None]:
from nltk.corpus import stopwords # stopwords are words that can be safely ignored, they don't add much meaning to a sentence
from string import punctuation # contains all punctuation characters
from nltk.tokenize import word_tokenize # splits a string into individual words called tokens

In [None]:
tokens = word_tokenize(data) # converts string into a list of tokens (words)
words = [word for word in tokens if not word in punctuation]
print(sorted(words))

In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

## Add extra stop words after viewing the results
# stop_words.add("'m")
# stop_words.add("'s")
# stop_words.add("''")
# stop_words.add("'ll")
# stop_words.add("``")
# stop_words.remove("yourself")
# print (sorted(stop_words))

filtered_words = [word for word in words if word not in stop_words]
print(sorted(filtered_words))

### Stemming and Lemmatization

To reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

In [None]:
from nltk.stem import PorterStemmer 
#process for removing the commoner morphological and inflexional endings from words in English, liked,likes->like
from nltk.stem import SnowballStemmer
#improved PorterStemmmer

stemmer = PorterStemmer()
stemmer2 = SnowballStemmer("english") #tell SnowballStemmer the language is English

stem_words = [stemmer2.stem(words) for words in filtered_words]
print (sorted(stem_words))

In [None]:
nltk.download('wordnet') #lexical database for English (nouns, adjectives, adverbs, verbs)

from nltk.stem import WordNetLemmatizer 
#the process of grouping together the different inflected forms of a word so they can be analyzed as a single item
wnl = WordNetLemmatizer()
print(wnl.lemmatize("cats"))
print(wnl.lemmatize("giggling"))
print(wnl.lemmatize("giggling", "v"))

lemm_words = [wnl.lemmatize(word) for word in filtered_words]
print(sorted(lemm_words))

### Speech Tagging
Words in a sentence can be categorized by their syntatic function, known as the part of speech (POS). Take a look at the table below to see some examples of POS tags. We can use the tag to help our lemmatizer to return a word to its original form.

Tag| Definition | Example
--- | --- | ---
CC | coordinating conjunction | and, but, for, etc
CD | cardinal digit | 0, 10, 523
DT | determiner | that, which
EX | existential there | "there is" ... think of it like "there exists"
FW | foreign word | 
IN | preposition/subordinating conjunction | above, toward, on, etc
JJ | adjective | big 
JJR | adjective, comparative |bigger
JJS | adjective, superlative | biggest
LS | list marker | 1)
MD | modal | could, will
NN | noun, singular | desk
NNS | noun plural | desks
NNP | proper noun, singular | Harrison
NNPS | proper noun, plural | Americans
PDT | predeterminer | all the kids
POS | possessive ending | parent's
PRP | personal pronoun | I, he, she
PRP$ |  possessive pronoun | my, his, hers
RB | adverb | very, silently
RBR |  adverb, comparative | better
UH | interjection | errrrrrrrm
VB | verb, base form | take
VBD | verb, past tense | took
VBG | verb, gerund/present participle | taking
VBN | verb, past participle | taken
VBP | verb, present tense non-3rd person singular | take
VBZ | verb, present tense 3rd person singular | takes
WDT | wh-determiner | which
WP | wh-pronoun | who, what
WP$ | possessive wh-pronoun | whose
WRB | wh-abverb | where, when
RBS  | adverb, superlative | best
RP  | particle | give up
TO  | infinitive marker| go 'to' the store


#### 1. Tagging our data

nltk.pos_tag() returns a tuple with the POS tag. The key here is to map NLTK’s POS tags to the format wordnet lemmatizer would accept. The get_wordnet_pos() function defined below does this mapping job.

Reference: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

In [None]:
text = word_tokenize("faking a review for tagging purpose")
nltk.pos_tag(text)

In [None]:
from collections import Counter
#count hashable objects

tags = nltk.pos_tag(filtered_words)
# print (tags[:5])

tag_counts = Counter(tag for word,tag in tags)
print (tag_counts)


#### 2. Use tags for lemmatization

pos_tag gets the tag for the word, it comes in form of a list of tuples[(word1, tag1)(word2, tag2)(word3, tag3)].

Use indexing to drill down: the first[0] gets to the individual tuples, the [1] gets to the tags, and the [0] grabs the first letter of a tag.

In [None]:
#tuples are collections of objects in order
from nltk.corpus import wordnet  
#nltk.download('averaged_perceptron_tagger')

#WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs 
#are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. 

print(nltk.pos_tag(["meaningful"])[0][1][0])
#is wrong, meaning automatically imported pretrained model is not good

#We are tagging each word so we can lemmatize it to common base form
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

print (get_wordnet_pos("facing"))
print (get_wordnet_pos("kindly"))

lemm_words = [wnl.lemmatize(w, get_wordnet_pos(w)) for w in filtered_words]
print (sorted(lemm_words)[:400])

### Counting Words
#### 1. Check Vocabulary Diversity
set() creates a distinct collection of the iterable elements (all words here).

In [None]:
distinct_words = len(set(lemm_words))
total_words = len(lemm_words)
vocab_diversity = distinct_words / total_words
print (distinct_words)  # number of distinct words
print (total_words)    # number of total words
print (f"The vocabulary diversity of the reviews is: {vocab_diversity}")

#### 2. Count total words and unique words

In [None]:
# create a dictionary to store uniques words and their counts.
count = {}
for w in lemm_words:
    #count[w] = count.get(w, 1)
    if w in count:
        count[w] += 1
    else:
        count[w] = 1
    
count = {k: v for k, v in sorted(count.items(), key=lambda item: item[1], reverse=True)} 
#count has two items: key and value
#word is the key and count is the value
#we are sorting from greatest to smallest

print (count)

# "key=lambda" allows us to sort our dictionary by value. 
# This is an example of a Lambda function, which is a function without a name.
# default it's ascending sort, "reverse=True" flips the order to descending.

#### 3. NLTK's Frequency Distributions Functions

We can initiate a frequency distribution by inputing our samples (a list of words) into `FreqDist` 

`fdist = FreqDist(samples)   #create a frequency distribution containing the given samples`

Function | Description
--- | ---
`fdist[sample] += 1`	| increment the count for this sample
`fdist['monstrous']`	| count of the number of times a given sample occurred
`fdist.freq('monstrous')`	| frequency of a given sample
`fdist.N()`	| total number of samples
`fdist.most_common(n)`	| the n most common samples and their frequencies
`for sample in fdist:`	| iterate over the samples
`fdist.max()`	| sample with the greatest count
`fdist.tabulate()`	| tabulate the frequency distribution
`fdist.plot()`	| graphical plot of the frequency distribution
`fdist.plot(cumulative=True)`	| cumulative plot of the frequency distribution
`fdist1 \|= fdist2`	| update fdist1 with counts from fdist2
`fdist1 < fdist2`   | test if samples in fdist1 occur less frequently than in fdist2

In [None]:
from nltk import FreqDist #frequency distribution

freq_words=FreqDist(lemm_words) # intialize the freq distribution on our sample of lemmatized words
freq_words.most_common(30)

In [None]:
! pip install matplotlib #creating static, animated, and interactive visualizations

In [None]:
import matplotlib.pyplot as plt #basically matlab
#plt.figure(figsize=(12, 5))  #(x,y)
#plt.title("Cummulative Frequency Distribution")
plot1 = FreqDist(lemm_words).plot(30, cumulative=True, color="black") #frequency distribution

plt.figure(figsize=(12, 5))  
plt.title("Non-cummulative Frequency Distribution")
plot2 = FreqDist(lemm_words).plot(30, cumulative=False, color="purple")

### Dispersion Plot

Show the location of words in the collection.

In [None]:
from nltk.draw.dispersion import dispersion_plot 
#allows for visualization of the lexical dispersion of words in a corpus, which is a collection of texts
dispersion_plot(lemm_words, ['movie', 'scene','character'])

Reference: 

        https://www.nltk.org/book/ch01.html
        https://www.nltk.org/

