This workshop we will be using NLTK library to walk you through some basic steps of a text analysis project. NLTK is one of the most popular tools to process human language data. 

Some basic steps of text analysis we are going to demonstrate include:

       -tokenize text
       -clean punctuations 
       -remove stop words 
       -stemm words 
       -tag words 
       -vocabulary diversity
       -word frequency distribution
       -sentiment analysis

In [None]:
from pathlib import Path
import pandas as pd
import os
import glob
import sys

### Import files

In order to import a folder of files, we use the os.chdir function to first navigate to the right directory.

Then we use glob.glob function to iterate through all files.

In [None]:
print (os.getcwd()) #getcwd helps check if we are at the right directory

my_dir ="Sample_data"
os.chdir(my_dir)   #change the current working directory to specified path. 

reviewList=[]
for files in glob.glob("*.txt"):   #glob.glob returns a list of pathnames. It helps us loop through all files
    df = pd.read_csv(files)
    for content in df:
#         print (content) #read reviews as strings
        reviewList.append(content)
print (reviewList)

Convert the review list into a huge string.

In [None]:
str1= " "
data = str1.join(reviewList)
data = data.replace("<br />","")
# print (data)

# tokens = nltk.word_tokenize(str(words)) #convert words from list to string and tokenize them
# print (tokens)

### Remove punctuation and stop words
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.

http://www.nltk.org/nltk_data/

In [None]:
import nltk
# nltk.download_shell()

In [None]:
# import re, pprint
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

In [None]:
punctuations = "$?:!.,;/\\&*)--(...-``''"

# tokens = word_tokenize(data)
tokens = word_tokenize(data.replace("."," "))
words =[word.lower() for word in tokens if not word in punctuations]
print (sorted(words))



In [None]:
stop_words = set(stopwords.words('english'))

## Add extra stop words after viewing the results
stop_words.add("m")
stop_words.add("'s")
# stop_words.remove("yourself")
# print (sorted(stop_words))

filtered_words = [word for word in words if not word in stop_words]
print (sorted(filtered_words)[:10])


### Stemming and Lemmatization

To reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

In [None]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
stemmer2 =SnowballStemmer("english")

stem_words = [stemmer2.stem(words) for words in filtered_words]
print (stem_words)

In [None]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
print (wnl.lemmatize("cats"))
print (wnl.lemmatize("giggling"))
print (wnl.lemmatize("giggling", "v"))

# lemm_words = [wnl.lemmatize(word) for word in filtered_words]
# print (sorted(lemm_words))


### Speech Tagging

    CC   coordinating conjunction
    CD   cardinal digit
    DT   determiner
    EX   existential there (like: "there is" ... think of it like "there exists")
    FW   foreign word
    IN   preposition/subordinating conjunction
    JJ   adjective 'big'
    JJR   adjective, comparative 'bigger'
    JJS   adjective, superlative 'biggest'
    LS   list marker 1)
    MD   modal could, will
    NN   noun, singular 'desk'
    NNS   noun plural 'desks'
    NNP   proper noun, singular 'Harrison'
    NNPS   proper noun, plural 'Americans'
    PDT   predeterminer 'all the kids'
    POS   possessive ending parent's
    PRP   personal pronoun I, he, she
    PRP$   possessive pronoun my, his, hers
    
    RB   adverb very, silently
    RBR   adverb, comparative better
    UH   interjection errrrrrrrm
    VB   verb, base form take
    VBD   verb, past tense took
    VBG   verb, gerund/present participle taking
    VBN   verb, past participle taken
    VBP   verb, sing. present, non-3d take
    VBZ   verb, 3rd person sing. present takes
    WDT   wh-determiner which
    WP   wh-pronoun who, what
    WP$   possessive wh-pronoun whose
    WRB   wh-abverb where, when
    RBS   adverb, superlative best
    RP   particle give up
    TO   to go 'to' the store


#### 1. Use tags for lemmatization

pos_tag gets the tag for the word, it comes in form of a list of tuples[(word1, tag1)(word2, tag2)(word3, tag3)].

Use indexing to drill down: the first[0] gets to the individual tuples, the [1] gets to the tags, and the [0] grabs the first letter of a tag.

In [None]:
from nltk.corpus import wordnet  

#WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs 
#are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. 

print (nltk.pos_tag(["meaningful"])[0][1][0])
       
       
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# print (get_wordnet_pos("facing"))
# print (get_wordnet_pos("kindly"))


lemm_words = [wnl.lemmatize(w, get_wordnet_pos(w)) for w in filtered_words]
print (sorted(lemm_words)[220:400])

#### 2. Tagging our data

nltk.pos_tag() returns a tuple with the POS tag. The key here is to map NLTK’s POS tags to the format wordnet lemmatizer would accept. The get_wordnet_pos() function defined below does this mapping job.

Reference: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

In [None]:
text = word_tokenize("faking a review for tagging purpose")
nltk.pos_tag(text)

In [None]:
from collections import Counter

tags = nltk.pos_tag(lemm_words)
# print (tags[:5])

tag_counts = Counter(tag for word,tag in tags)
print (tag_counts)


### Counting Words

#### 1. Check Vocabulary Diversity
set() creates a distinct collection of the iterable elements (all words here).

In [None]:
print (len(set(lemm_words)))  # number of distinct words
print (len(lemm_words))    # number of total words
print ("The vocabulary diversity of the reviews is: "+ str(len(set(lemm_words))/len(lemm_words)))

#### 2. Count total words and unique words

In [None]:
# create a dictionary to store uniques words and their counts.
count = {}
for w in lemm_words:
    if w in count:
        count[w] += 1
    else:
        count[w] = 1
    
# count = {k: v for k, v in sorted(count.items(), key=lambda item: item[1], reverse=True)}
print (count)

# "key=lambda" allows us to sort our dictionary by value. 
# This is an example of a Lambda function, which is a function without a name.
# default it's ascending sort, "reverse=True" flips the order to descending.


#### 3. NLTK's Frequency Distributions Functions

    fdist = FreqDist(samples)	create a frequency distribution containing the given samples
    fdist[sample] += 1	increment the count for this sample
    fdist['monstrous']	count of the number of times a given sample occurred
    fdist.freq('monstrous')	frequency of a given sample
    fdist.N()	total number of samples
    fdist.most_common(n)	the n most common samples and their frequencies
    for sample in fdist:	iterate over the samples
    fdist.max()	sample with the greatest count
    fdist.tabulate()	tabulate the frequency distribution
    fdist.plot()	graphical plot of the frequency distribution
    fdist.plot(cumulative=True)	cumulative plot of the frequency distribution
    fdist1 |= fdist2	update fdist1 with counts from fdist2
    fdist1 < fdist2	test if samples in fdist1 occur less frequently than in fdist2

In [None]:
from nltk import FreqDist

freq_words=FreqDist(lemm_words)
freq_words.most_common(30)

In [None]:
! pip install matplotlib

In [None]:
# import matplotlib.pyplot as plt
# plt.figure(figsize=(10, 5))  
plt.title("Cummulative Frequency Distribution")

plot1 = FreqDist(lemm_words).plot(30, cumulative=True, color="black")

# import matplotlib.pyplot as plt
# plt.figure(figsize=(10, 5))  
plt.title("Non-cummulative Frequency Distribution")
plot2 = FreqDist(lemm_words).plot(30, cumulative=False, color="purple")


### Dispersion Plot

Show the location of words in the collection.

In [None]:
from nltk.draw.dispersion import dispersion_plot
dispersion_plot(lemm_words, ['movie', 'scene','character'])


Reference: 

        https://www.nltk.org/book/ch01.html
        https://www.nltk.org/

