# Section A: Preprocessing of Text¶

We have done some text processing in Practical 2. This practical will continue processing the text so that we could do some work related to natural language processing (NLP) later.

## Beforehand...
** 1.1 NLTK Setup  **
   - NLTK is included with the Anaconda Distribution of Python, or can be downloaded directly from nltk.org. 
   - Once NLTK is installed, the text data files (corpora) should be downloaded.  See the following cell to start the download.

In [None]:
import nltk

# uncomment the line below to download NLTK resources the first time NLTK is used and RUN this cell.
# when the "NLTK Downloader" dialog appears (takes 10-20 seconds), click on the "download" button 
#nltk.download()

In [None]:
# import necessary packages

from __future__ import division
import nltk, re
from nltk import word_tokenize
from nltk import regexp_tokenize
from nltk.tag import pos_tag
from nltk.draw import tree
from nltk.corpus import stopwords

Previously in Practical 1, we learnt to tokenize the text into individual words. Examine the output of the tokens below. You may see non-useful tokens in NLP, such as punctuation marks.

Furthermore, plural and singular words are different for the computer although they have the same meaning.

### What's Tokenization?
<img src="https://i.ibb.co/30VPq9D/Tokenization.jpg" style="max-width:50%;" alt="Tokenization" border="0">

In [None]:
raw = """Python is delicious. Default taggers are assigning their tag to every single
word, even words that have never been encountered before.
Once you have accomplished small things, 
you may attempt great ones safely."""

# Tokenize the text
Token1 = word_tokenize(raw) #Method 1
print(Token1)

To remove the punctuation marks, we may split the text using Regex (regular expression).

In [None]:
# Tokenize the text using regex_tokenize
Token2 = regexp_tokenize(raw, pattern='\w+') #Method 2

# Split the text with regex '\W+'
Token3 = re.split("\W+",raw) #Method 3

print('Token 2:', Token2)
print('Token 3:', Token3)

Morphology computes the base form of English word, by removing the differences of affix-prefix, plural-singular, uppercase-lowercase. We can use the Porter Stemmer algorithm to perform the morphological processing.

In [None]:
def stem1(wordList):
    p = nltk.PorterStemmer()
    result = [p.stem(word) for word in wordList] #['s', 'o', 'c', ',', ' ', '3', '2', '\n',...]
    return result


We may also use Regex to extract the root of the words that end with 's', 'es', 'ing', etc.

In [None]:
def stem2(wordList):
    output = []
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    for word in wordList:
        stem, suffix = re.findall(regexp, word)[0]
        output.append(stem)
    return output

re.findall(regexp, word)[0]: Return all non-overlapping matches of pattern in string, as a list of strings.

In [None]:
# Token1
text = nltk.Text(stem1(Token1[:-1])) # What does it mean by this [:-1] << face?
print('Stem1, Token1', text[:]) # stem1 using Porter Stemmer, Token1(token with punctuation)

print('****************************************************')

text = nltk.Text(stem2(Token1[:-1]))
print('Stem2, Token1', text[:]) # stem2 using RegEx, Token1(token with punctuation)

# Question: Can you spot any differences between two stemmers?

In [None]:
#Token2
text = nltk.Text(stem1(Token2[:-1])) # You should know what does it mean by this smiley face [:-1]
print('Stem1, Token2', text[:]) # stem1 using Porter Stemmer, Token2(punctuation removed using regex_tokenize)

print('****************************************************')

text = nltk.Text(stem2(Token2[:-1])) 
print('Stem2, Token2', text[:]) # stem2 using RegEx, Token2(punctuation removed using regex_tokenize)

In [None]:
#Token3
text = nltk.Text(stem1(Token3[:-1]))
print('Stem1, Token3', text[:])  # stem1 using Porter Stemmer, Token3(token with punctuation removed using split function)

print('****************************************************')

text = nltk.Text(stem2(Token3[:-1]))
print('Stem2, Token3', text[:])  # stem2 using RegEx, Token3(token with punctuation removed using split function)

# Section B: Analyze the Structure of the Text
<b>Categorization and Tagging</b>: Automatically tag the text using pos_tag()

In [None]:
print(pos_tag(text))

<b>Analyzing sentence structure:</b> Lets parse a sentence: <i>I shot an elephant in my pajamas</i>

In [None]:
groucho_grammar = nltk.CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | 'I'
    VP -> V NP | VP PP
    Det -> 'an' | 'my'
    N -> 'elephant' | 'pajamas'
    V -> 'shot'
    P -> 'in'
    """)
sent = word_tokenize("I shot an elephant in my pajamas")
parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    print(tree)
    tree.draw()

# Section C: Statistical Analysis using Simple Statistics
Find words from string using Regex

In [None]:
# to find the words that start with 'a'
print(re.findall(r"\ba[\w]*", raw)) #\b -- boundary between word and non-word

print('****************************************************')

# to find the words with at least 8 characters
print(re.findall(r"\b\w{8,}\b", raw))

To find words from a list of words from Words Corpus

In [None]:
wordList = [w for w in nltk.corpus.words.words('en') if w.islower()] #remove any proper names.
print(wordList[:30])

print('****************************************************')

#Wondering what is the corpus looks like?
print(nltk.corpus.words.words('en')[:50] )

We would like to determine the frequency of the meaningful words in a passage. In this case, we need to <b>exclude</b> all the stop words from the text. Stopwords usually have <b>little lexical content</b>.

In [None]:
# the stop words in English that have little lexical content
stopwords = stopwords.words('english')
content = [w for w in text if w.lower() not in stopwords]
for w in content:
    print(w)
content_fraction = len(content)/len(text)
print("Lexical content", content_fraction)

# NLP Showcase
## ** 1 Name Gender Classifier **

In [None]:
# code to build a classifier to classify names as male or female
# demonstrates the basics of feature extraction and model building

names = [(name, 'male') for name in nltk.corpus.names.words("male.txt")]
names += [(name, 'female') for name in nltk.corpus.names.words("female.txt")]

def extract_gender_features(name):
    name = name.lower()
    features = {}
    features["suffix"] = name[-1:]
    features["suffix2"] = name[-2:] if len(name) > 1 else name[0]
    features["suffix3"] = name[-3:] if len(name) > 2 else name[0]
    features["suffix4"] = name[-4:] if len(name) > 3 else name[0]
    #features["suffix5"] = name[-5:] if len(name) > 4 else name[0]
    #features["suffix6"] = name[-6:] if len(name) > 5 else name[0]
    features["prefix"] = name[:1]
    features["prefix2"] = name[:2] if len(name) > 1 else name[0]
    features["prefix3"] = name[:3] if len(name) > 2 else name[0]
    features["prefix4"] = name[:4] if len(name) > 3 else name[0]
    features["prefix5"] = name[:5] if len(name) > 4 else name[0]
    #features["wordLen"] = len(name)
    
    #for letter in "abcdefghijklmnopqrstuvwyxz":
    #    features[letter + "-count"] = name.count(letter)
   
    return features

data = [(extract_gender_features(name), gender) for (name,gender) in names]

import random
random.shuffle(data)

#print(data[:10])
#print()
#print(data[-10:])

dataCount = len(data)
trainCount = int(.8*dataCount)

trainData = data[:trainCount]
testData = data[trainCount:]
bayes = nltk.NaiveBayesClassifier.train(trainData)

def classify(name):
    label = bayes.classify(extract_gender_features(name))
    print("name=", name, "classifed as=", label)

print("trainData accuracy=", nltk.classify.accuracy(bayes, trainData))
print("testData accuracy=", nltk.classify.accuracy(bayes, testData))

bayes.show_most_informative_features(25)

In [None]:
# print gender classifier errors so we can design new features to identify the cases
errors = []

for (name,label) in names:
    if bayes.classify(extract_gender_features(name)) != label:
        errors.append({"name": name, "label": label})

errors


## ** 2 Sentiment Analysis **

In [None]:
# movie reviews / sentiment analysis - part #1
from nltk.corpus import movie_reviews as reviews
import random

docs = [(list(reviews.words(id)), cat)  for cat in reviews.categories() for id in reviews.fileids(cat)]
random.shuffle(docs)

print([ (len(d[0]), d[0][:2], d[1]) for d in docs[:10]])

fd = nltk.FreqDist(word.lower() for word in reviews.words())
topKeys = [ key for (key,value) in fd.most_common(2000)]
print(topKeys)

In [None]:
# movie reviews sentiment analysis - part #2
import nltk

def review_features(doc):
    docSet = set(doc)
    features = {}
    
    for word in topKeys:
        features[word] = (word in docSet)
        
    return features

#review_features(reviews.words("pos/cv957_8737.txt"))

data = [(review_features(doc), label) for (doc,label) in docs]

dataCount = len(data)
trainCount = int(.8*dataCount)

trainData = data[:trainCount]
testData = data[trainCount:]
bayes2 = nltk.NaiveBayesClassifier.train(trainData)

print("train accuracy=", nltk.classify.accuracy(bayes2, trainData))
print("test accuracy=", nltk.classify.accuracy(bayes2, testData))

bayes2.show_most_informative_features(20)
