## BUSA90543 Text Analytics for Business
### Week 1 Practical exercises

This is an iPython notebook which aims to tokenise some text using NLTK.

In [None]:
import re
import nltk
import string

If this is the first time you are using NLTK, you might need to download some of the data - for today, this is just "punkt" (from Models) and "stopwords" (from Corpora)

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/shauncai/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shauncai/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### 2. Pre-processing the HTML formatting

Let's open the file, and remove the HTML metadata:

In [None]:
# Open the file for reading
with open('Gf-wiki.html', encoding='utf8') as wiki:
  wiki_file = wiki.read()

# We would like to remove the HTML metadata, but this doesn't do it very well
wiki_file = re.sub(r'[^A-Za-z \n]', '', wiki_file)
wiki_file = re.sub(r'\n+', '\n', wiki_file)

# Change this to writing to a file
print(wiki_file)

DOCTYPE html
 saved from urlhttpsenwikipediaorgwikiTheGodfather 
html classclientjs venotavailable langen dirltrheadmeta httpequivContentType contenttexthtml charsetUTF
titleThe Godfather  Wikipediatitle
link relstylesheet hrefgfwikifullfilesloadphp
meta nameResourceLoaderDynamicStyles content
link relstylesheet hrefgfwikifullfilesloadphp
meta namegenerator contentMediaWiki wmf
meta namereferrer contentoriginwhencrossorigin
meta propertyogimage contenthttpsuploadwikimediaorgwikipediaencGodfatherverjpg
link relalternate hrefandroidapporgwikipediahttpenmwikipediaorgwikiTheGodfather
link relalternate typeapplicationxwiki titleEdit this page hrefhttpsenwikipediaorgwindexphptitleTheGodfatherampactionedit
link reledit titleEdit this page hrefhttpsenwikipediaorgwindexphptitleTheGodfatherampactionedit
link relappletouchicon hrefhttpsenwikipediaorgstaticappletouchwikipediapng
link relshortcut icon hrefhttpsenwikipediaorgstaticfaviconwikipediaico
link relsearch typeapplicationopensearchdescripti

Here's an alternative way of doing the same thing line-by-line:

In [None]:
# Open the file for reading
with open('Gf-wiki.html', encoding='utf8') as wiki:
  # Let's go through the input, one line at a time
  for line in wiki:
    # We would like to remove the HTML metadata, but this doesn't do it very well \n
    line_raw = re.sub(r'[^A-Za-z \n]', '', line)
    line_raw = re.sub(r'\n+', '\n', line_raw)
    # Change this to writing to a file
    # print(line_raw)

### 3. Tokenization with NLTK tokenizer

In [None]:
# Tokenize into word tokens
tokens = nltk.word_tokenize(wiki_file)
print(tokens[:10])  # print the first 10 tokens

['DOCTYPE', 'html', 'saved', 'from', 'urlhttpsenwikipediaorgwikiTheGodfather', 'html', 'classclientjs', 'venotavailable', 'langen', 'dirltrheadmeta']


In [None]:
brando = 0
for token in tokens:
    if token == "Brando":
        pass  # How do we count the token "brando" here?

print(brando)

# Let's build up a word list
word_list = []
for token in tokens:
    # If the word isn't in the list, we need to add it
    if token not in word_list:
        word_list.append(token)

# How many words are there in the list?
#print(...)

# The following is a _very_ inefficient way of solving this problem
# We'll look at a better way next week
max_word = ""  # most common word
max_num = 0  # count of `max_word`
for word in word_list:
    this_num = 0  # stores the count of `word`
    for token in tokens:
        if token == word:
            this_num += 1
    if this_num > max_num:
        max_num = this_num
        max_word = word
print(f'{max_word}: {max_num} times')

### 4. Tokenization with strategies given in the lecture

Now, we'll tokenise the text.
Remember, that we defined this as:
 - strip formatting (which we've already done)
 - strip punctuation
 - fold case
 - break at whitespace
 - stem
 - remove stopwords
 - bag of words (which we'll examine further next week)

In [None]:
# But, if we are doing this ourselves:

# You might wish to do this
#wiki_file = re.sub(r'\s+',' ',wiki_file)

# HTML files have a strange syntax for certain kinds of characters
# e.g., &nbsp for non-breaking space
wiki_file = re.sub(r'&[^;]*;', '', wiki_file)

# This is a good place to strip punctuation
#wiki_file = ...

# This is a good place to fold case using the lower() method
#wiki_file = ...

# This is one way to break at whitespace (it will also remove punctuation)
tokens = re.split(r'\W+', wiki_file)

# Remove empty strings which occur in cases like 'abc\n' -> ['abc', '']
tokens = [tok for tok in tokens if tok]

# Here is one possible stemmer, you can use it with porter.stem(token)
porter = nltk.PorterStemmer()
stemmed_tokens = []
for token in tokens:
  pass  # How to stem token here?

# There is a list of stop words in NLTK, but you will need the data
from nltk.corpus import stopwords
stop_set = set(stopwords.words())  # use a set for efficiency

# Or you could download a list from the internet
# stop_set = set()
# with open('stopwords.txt', encoding='utf8') as stop:
#   for line in stop:
#     stop_set.add(line.rstrip())  # remove trailing newline

# And then we only want tokens that aren't in the list
stopped_tokens = []
for token in stemmed_tokens:
    pass  # Change this

Now, let's count the frequency of the word "Brando", and try to find the most frequent word in the document

In [None]:
brando = 0
for token in stopped_tokens:
    if token == "brando":  # now lowercased
        brando += 1
print(brando)

word_list = []
for token in stopped_tokens:
    if token not in word_list:
        word_list.append(token)
print(len(word_list))

max_num = 0
max_word = ""
for word in word_list:
    this_num = 0
    for token in stopped_tokens:
        if word == token:
            this_num += 1
    if this_num > max_num:
        max_num = this_num
        max_word = word
print(f'{max_word}: {max_num} times')