<a href="https://colab.research.google.com/github/GTLibraryDataVisualization/Introduction-to-Text-Analytics-with-Python/blob/master/HandsOn_Text_Analytics_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This workshop we will be using the NLTK library to walk you through some basic steps of a text analytics project. NLTK is a library used to work with human language data.

"Text analytics is the automated process of translating large volumes of unstructured text into quantitative data to uncover insights, trends, and patterns. Combined with data visualization tools, this technique enables companies to understand the story behind the numbers and make better decisions." -monkeylearn.com

Some basic steps of text analysis we are going to demonstrate include:

       -tokenize text
       -clean punctuations 
       -remove stop words 
       -stemm words 
       -tag words 
       -vocabulary diversity
       -word frequency distribution

In [None]:
from pathlib import Path 
import os 
import glob 
import sys 

### Import files

Make sure to import sample_data into Google Colab in files.

In order to import a folder of files, we use the os.chdir function to first navigate to the right directory.

Then we use glob.glob function to iterate through all files.

In [None]:
my_dir = "sample_data"
os.chdir(my_dir)   

In [None]:
reviewList=[]
#Your Code Here


We want a bag of words, so we convert the review list into a huge string.

In [None]:
str1 = " "
#Your Code Here


### Remove punctuation and stop words
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.

Stop words are words that are so commonly used that they carry very little useful information.

We will be using the NLTK to help us.
http://www.nltk.org/nltk_data/

In [None]:
import nltk
# nltk.download_shell() for mac users

nltk.download("punkt")
nltk.download("omw-1.4")
nltk.download("averaged_perceptron_tagger")

from nltk.corpus import stopwords 
from string import punctuation
from nltk.tokenize import word_tokenize 

In [None]:
#Your Code Here
punctuation

In [None]:
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

#Your Code Here
stop_words

### Stemming and Lemmatization

To reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

In [None]:
from nltk.stem import PorterStemmer 
from nltk.stem import SnowballStemmer

#Your Code Here

In [None]:
nltk.download('wordnet') #lexical database for English (nouns, adjectives, adverbs, verbs)

from nltk.stem import WordNetLemmatizer 

#Your Code Here

### Speech Tagging
Words in a sentence can be categorized by their syntatic function, known as the part of speech (POS). Take a look at the table below to see some examples of POS tags. We can use the tag to help our lemmatizer to return a word to its original form.

Tag| Definition | Example
--- | --- | ---
CC | coordinating conjunction | and, but, for, etc
CD | cardinal digit | 0, 10, 523
DT | determiner | that, which
EX | existential there | "there is" ... think of it like "there exists"
FW | foreign word | 
IN | preposition/subordinating conjunction | above, toward, on, etc
JJ | adjective | big 
JJR | adjective, comparative |bigger
JJS | adjective, superlative | biggest
LS | list marker | 1)
MD | modal | could, will
NN | noun, singular | desk
NNS | noun plural | desks
NNP | proper noun, singular | Harrison
NNPS | proper noun, plural | Americans
PDT | predeterminer | all the kids
POS | possessive ending | parent's
PRP | personal pronoun | I, he, she
PRP$ |  possessive pronoun | my, his, hers
RB | adverb | very, silently
RBR |  adverb, comparative | better
UH | interjection | errrrrrrrm
VB | verb, base form | take
VBD | verb, past tense | took
VBG | verb, gerund/present participle | taking
VBN | verb, past participle | taken
VBP | verb, present tense non-3rd person singular | take
VBZ | verb, present tense 3rd person singular | takes
WDT | wh-determiner | which
WP | wh-pronoun | who, what
WP\$ | possessive wh-pronoun | whose
WRB | wh-abverb | where, when
RBS  | adverb, superlative | best
RP  | particle | give up
TO  | infinitive marker| go 'to' the store


#### 1. Tagging our data

nltk.pos_tag() returns a tuple with the POS tag. The key here is to map NLTK’s POS tags to the format wordnet lemmatizer would accept. The get_wordnet_pos() function defined below does this mapping job.

Reference: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

In [None]:
text = word_tokenize("faking a review for tagging purpose")
#Your Code Here

In [None]:
from collections import Counter

#Your Code Here


#### 2. Use tags for lemmatization

pos_tag gets the tag for the word, it comes in form of a list of tuples[(word1, tag1)(word2, tag2)(word3, tag3)].

In [None]:
from nltk.corpus import wordnet  
       
def get_wordnet_pos(tokens: list) -> list:
    tag_dict = {
                "J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV
                }
    tags = nltk.pos_tag(tokens)

    return [(word, tag_dict.get(tag[0], wordnet.NOUN)) for word, tag in tags]

pos_str = "The quick brown fox jumped over the lazy dog"

#Your Code Here

### Counting Words

#### 1. Check Vocabulary Diversity
set() creates a distinct collection of the iterable elements (all words here).

In [None]:
distinct_words = len(set(lemm_words))
total_words = len(lemm_words)

print (distinct_words)  # number of distinct words
print (total_words)    # number of total words
#Your Code Here

#### 2. Count total words and unique words

In [None]:
# create a dictionary to store uniques words and their counts.
count = {}
#Your Code Here


#### 3. NLTK's Frequency Distributions Functions

We can initiate a frequency distribution by inputing our samples (a list of words) into `FreqDist` 

`fdist = FreqDist(samples)   #create a frequency distribution containing the given samples`

Function | Description
--- | ---
`fdist[sample] += 1`	| increment the count for this sample
`fdist['monstrous']`	| count of the number of times a given sample occurred
`fdist.freq('monstrous')`	| frequency of a given sample
`fdist.N()`	| total number of samples
`fdist.most_common(n)`	| the n most common samples and their frequencies
`for sample in fdist:`	| iterate over the samples
`fdist.max()`	| sample with the greatest count
`fdist.tabulate()`	| tabulate the frequency distribution
`fdist.plot()`	| graphical plot of the frequency distribution
`fdist.plot(cumulative=True)`	| cumulative plot of the frequency distribution
`fdist1 \|= fdist2`	| update fdist1 with counts from fdist2
`fdist1 < fdist2`   | test if samples in fdist1 occur less frequently than in fdist2

In [None]:
from nltk import FreqDist #frequency distribution

#Your Code Here

In [None]:
! pip install matplotlib #creating static, animated, and interactive visualizations

In [None]:
import matplotlib.pyplot as plt #basically matlab

#Your Code Here

### Dispersion Plot

Show the location of words in the collection.

In [None]:
from nltk.draw.dispersion import dispersion_plot 

#Your Code Here

Reference: 

        https://www.nltk.org/book/ch01.html
        https://www.nltk.org/

