This workshop we will be using the NLTK library to walk you through some basic steps of a text analytics project. NLTK is a library used to work with human language data.

"Text analytics is the automated process of translating large volumes of unstructured text into quantitative data to uncover insights, trends, and patterns. Combined with data visualization tools, this technique enables companies to understand the story behind the numbers and make better decisions." -monkeylearn.com

Some basic steps of text mining we are going to demonstrate include:

       -tokenize text
       -clean punctuations 
       -remove stop words 
       -stem and lemmatize words 
       -tag words 
       -parse words
       -work with text data in pandas and dataframes
       -sentiment analysis

In [None]:
from pathlib import Path 
import pandas as pd 
import os 
import glob 
import sys 

### Import files

In order to import a folder of files, we use the os.chdir function to first navigate to the right directory.

Then we use glob.glob function to iterate through all files.

In [None]:
my_dir = "Sample_data"
os.chdir(my_dir) 

In [None]:
reviewList=[]
#Your Code Here


Convert the review list into a huge string.

In [None]:
#Your Code Here


### Remove punctuation and stop words
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.

http://www.nltk.org/nltk_data/

In [None]:
import nltk
# nltk.download_shell() for mac users

In [None]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

In [None]:
nltk.download('punkt')

punctuations = "$?:!.,;/\\&*)--(...-``''" 
#Your Code Here


In [None]:
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
stop_words.add("'m")

#Your Code Here


### Stemming and Lemmatization

To reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

In [None]:
from nltk.stem import PorterStemmer 
from nltk.stem import SnowballStemmer

#Your Code Here


In [None]:
nltk.download('wordnet') #lexical database for English (nouns, adjectives, adverbs, verbs)
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer 

#Your Code Here


### Speech Tagging

    CC   coordinating conjunction
    CD   cardinal digit
    DT   determiner
    EX   existential there (like: "there is" ... think of it like "there exists")
    FW   foreign word
    IN   preposition/subordinating conjunction
    JJ   adjective 'big'
    JJR   adjective, comparative 'bigger'
    JJS   adjective, superlative 'biggest'
    LS   list marker 1)
    MD   modal could, will
    NN   noun, singular 'desk'
    NNS   noun plural 'desks'
    NNP   proper noun, singular 'Harrison'
    NNPS   proper noun, plural 'Americans'
    PDT   predeterminer 'all the kids'
    POS   possessive ending parent's
    PRP   personal pronoun I, he, she
    PRP$   possessive pronoun my, his, hers
    
    RB   adverb very, silently
    RBR   adverb, comparative better
    UH   interjection errrrrrrrm
    VB   verb, base form take
    VBD   verb, past tense took
    VBG   verb, gerund/present participle taking
    VBN   verb, past participle taken
    VBP   verb, sing. present, non-3d take
    VBZ   verb, 3rd person sing. present takes
    WDT   wh-determiner which
    WP   wh-pronoun who, what
    WP$   possessive wh-pronoun whose
    WRB   wh-abverb where, when
    RBS   adverb, superlative best
    RP   particle give up
    TO   to go 'to' the store


#### 1. Use tags for lemmatization

pos_tag gets the tag for the word, it comes in form of a list of tuples[(word1, tag1)(word2, tag2)(word3, tag3)].

Use indexing to drill down: the first[0] gets to the individual tuples, the [1] gets to the tags, and the [0] grabs the first letter of a tag.

In [None]:
from nltk.corpus import wordnet  
nltk.download('averaged_perceptron_tagger')

print(nltk.pos_tag(["meaningful"])[0][1][0])
       
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

#Your Code Here


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


N
["'conflict", "'d", "'ll", "'ll", "'ll", "'ll", "'ll", "'ll", "'officer", "'re", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'ve", "'ve", "'ve", "'ve", '1/2', '10', '10', '10', '100th', '12', '15', '1940s', '1970', '1982', '1million', '1st', '1st', '2', '2', '2/3', '20', '20', '2003', '22', '22', '24', '25', '4', '4', '40', '5', '5m', '6', '6', '60', '7', '70mm', '77', '77', 'ability', 'ability', 'able', 'able', 'abyss', 'accident', 'acclaim', 'achieve', 'act', 'act', 'act', 'act', 'action', 'action', 'action', 'actor', 'actor', 'actor', 'actor', 'actor', 'actor', 'actual', 'actually', 'actually', 'adapts', 'admit', 'advance', 'aerial', 'aftermath', 'aftertaste', 'afterward', 'age', 'age', 'agree', 'air', 'airstation', 'ala', 'alaska', 'albert', 'allow', 'almost', 'almost', 'almost', 'along', 'alongside', 'already

#### 2. Tagging our data

nltk.pos_tag() returns a tuple with the POS tag. The key here is to map NLTK’s POS tags to the format wordnet lemmatizer would accept. The get_wordnet_pos() function defined below does this mapping job.

Reference: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

In [None]:
text = word_tokenize("faking a review for tagging purpose")
#Your Code Here


In [None]:
from collections import Counter

#Your Code Here


### Counting Words

#### 1. Check Vocabulary Diversity
set() creates a distinct collection of the iterable elements (all words here).

In [None]:
print (len(set(lemm_words)))  # number of distinct words
print (len(lemm_words))    # number of total words
#Your Code Here


#### 2. Count total words and unique words

In [None]:
# create a dictionary to store uniques words and their counts.
count = {}
#Your Code Here

# "key=lambda" allows us to sort our dictionary by value. 
# This is an example of a Lambda function, which is a function without a name.
# default it's ascending sort, "reverse=True" flips the order to descending.


#### 3. NLTK's Frequency Distributions Functions

    fdist = FreqDist(samples)	create a frequency distribution containing the given samples
    fdist[sample] += 1	increment the count for this sample
    fdist['monstrous']	count of the number of times a given sample occurred
    fdist.freq('monstrous')	frequency of a given sample
    fdist.N()	total number of samples
    fdist.most_common(n)	the n most common samples and their frequencies
    for sample in fdist:	iterate over the samples
    fdist.max()	sample with the greatest count
    fdist.tabulate()	tabulate the frequency distribution
    fdist.plot()	graphical plot of the frequency distribution
    fdist.plot(cumulative=True)	cumulative plot of the frequency distribution
    fdist1 |= fdist2	update fdist1 with counts from fdist2
    fdist1 < fdist2	test if samples in fdist1 occur less frequently than in fdist2

In [None]:
from nltk import FreqDist #frequency distribution

#Your Code Here


In [None]:
! pip install matplotlib #creating static, animated, and interactive visualizations

In [None]:
import matplotlib.pyplot as plt #basically matlab
#Your Code Here


### Dispersion Plot

Show the location of words in the collection.

In [None]:
from nltk.draw.dispersion import dispersion_plot 

#Your Code Here


Reference: 

        https://www.nltk.org/book/ch01.html
        https://www.nltk.org/

