## Introduction to Python for Digital Text Analysis (Part II)

This session will provide an overview of the Python Natural Language Toolkit (NLTK) library (http://www.nltk.org), which is an excellent platform for examining linguistic data. It has built-in corpora and text processing libraries for everything from tokenisation to semantic reasoning. NLTK is also favoured by teachers of computational linguistics.

We will apply some basic NLTK functionalities to a few YouTube comment files in our Kpop dataset, and examine them individually as well as comparatively.

### Step I: Import necessary packages

In [None]:
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.probability import FreqDist
from nltk.tag import pos_tag, map_tag
from nltk import bigrams
from nltk.collocations import *

### Step II: Read in comment files
Let's choose four popular songs, one from each of the Kpop groups, and import their comment text.

In [None]:
# First we need to save the paths of the comment-only files we are using.
bts_filepath = '../data/kpop_videos_comments/bts/GZjt_sA2eso.txt' # Save Me
exo_filepath = '../data/kpop_videos_comments/exo/yWfsla_Uh80.txt' # Call Me Baby
twice_filepath = '../data/kpop_videos_comments/twice/EpMwiqW8k8o.txt' # 'Signal' dance video
blackpink_filepath = '../data/kpop_videos_comments/blackpink/bwmSjveL3Lc.txt' # Boombayah

# Next, let's read in the text from the files as strings, using UTF 8 encoding to recognise emoji.
with open(bts_filepath, encoding="utf-8") as text:
    bts = text.read()
with open(exo_filepath, encoding="utf-8") as text:
    exo = text.read()
with open(twice_filepath, encoding="utf-8") as text:
    twice = text.read()  
with open(blackpink_filepath, encoding="utf-8") as text:
    blackpink = text.read()

print(bts[:300]) # Print first 300 characters.

In [None]:
# Remove CommentTextDisplay from all files by stripping off the first 19 characters (part of metadata).
bts = bts[19:]
exo = exo[19:]
twice = twice[19:]
blackpink = blackpink[19:]
print(bts[:300])

### Step III: Tokenise the comments

Now that we've read in the comment files as strings, let's tokenize them so that we can analyse their words in various ways. This will transform the strings into lists of 'words'.

NLTK's default word tokenizer ignores non-alpha characters (e.g., hashtags and emoji). We will use its tweet tokenizer, which recognises such characters and does not strip them away: http://www.nltk.org/api/nltk.tokenize.html

In [None]:
# Let's ignore cases, so that 'BTS' and 'bts' for example are treated as the same type.
bts_tokenized = TweetTokenizer().tokenize(bts.lower())
exo_tokenized = TweetTokenizer().tokenize(exo.lower())
twice_tokenized = TweetTokenizer().tokenize(twice.lower())
blackpink_tokenized = TweetTokenizer().tokenize(blackpink.lower())
print(bts_tokenized[:50]) # The first 50 'words'.

### Step IV: Word-level calculations

After tokenizing our comment files, we can count their 'words' in a variety of different ways.

In [None]:
# Length of a text (token count). Note that punctuation symbols and emoji are also counted as tokens.
print(len(bts_tokenized))
print(len(exo_tokenized))
print(len(twice_tokenized))
print(len(blackpink_tokenized))

In [None]:
# Number of unique vocabulary items (type count). Sets in Python contain unique objects.
print(len(set(bts_tokenized)))
print(len(set(exo_tokenized)))
print(len(set(twice_tokenized)))
print(len(set(blackpink_tokenized)))

From type count, we can calculate *lexical richness*: the number of unique words divided by the number of total words. Which Kpop video has the most lexically diverse comments?

In [None]:
def lexical_diversity(comments):
    return len(set(comments))/len(comments)

print(lexical_diversity(bts_tokenized))
print(lexical_diversity(exo_tokenized))
print(lexical_diversity(twice_tokenized))
print(lexical_diversity(blackpink_tokenized))

Now let's examine the frequency of specific types in the BTS dataset. We can see what the most common types are. We can also compare several band members to see who is mentioned more often.

In [None]:
# Frequency distribution of types.
fdistbts = FreqDist(bts_tokenized)
fdistexo = FreqDist(exo_tokenized)
fdisttwice = FreqDist(twice_tokenized)
fdistblackpink = FreqDist(blackpink_tokenized)

# 50 most frequent types in the BTS video comments.
fdistbts.most_common(50)

Now let's examine specific words in the BTS dataset. We can compare several band members to see who is mentioned more often.

In [None]:
# Frequency of a specific type.
print(fdistbts["jimin"])
print(fdistbts["jungkook"])
print(fdistbts["suga"])

# Just for fun, let's see how often 'BTS' appears in the comments of rival group EXO's video, and vice versa.
print(fdistexo["bts"])
print(fdistbts["exo"])

Now let's look at some more peculiar types: those that appear only once (hapax legomena), those that are extremely long, and those that are both long and frequently occurring. Such words often add a different perspective on a corpus of text (they're a bit like linguistic outliers!).

In [None]:
# How many hapax legomena are there in the BTS comments?
len(fdistbts.hapaxes())

In [None]:
# Let's just look at the first 100.
print(fdistbts.hapaxes()[:100])

In [None]:
# All of the long words (more than 50 characters) in the BTS comments.
bts_vocab = set(bts_tokenized)
bts_long_words = [word for word in bts_vocab if len(word)>50]
sorted(bts_long_words)

In [None]:
# There are too many URLs in the 'long words list'. Let's remove them and try again.
bts_vocab_nourls = []

for word in bts_vocab:
    if not word.startswith('http'):
        if not word.startswith('www'):
            bts_vocab_nourls.append(word)

bts_long_words = [word for word in bts_vocab_nourls if len(word)>50]
sorted(bts_long_words)

In [None]:
# Now let's examine the words of at least 12 characters that occur more than 5 times.
bts_long_frequent_words = [word for word in bts_vocab_nourls if len(word)>12 and fdistbts[word]>5]
sorted(bts_long_frequent_words)

### Step V: Part-of-speech (POS) tagging

Having examined counts of specific types, let's take a step back and group the types into their POS categories. NLTK's default POS tagger uses tags from the Penn Treebank project: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
# To save computing time, let's just look at the tags of the first 10,000 types in the BTS comments.
bts_tagged = nltk.pos_tag(bts_tokenized[:10000])

In [None]:
# If you get an error when running the above code block, run this one. Otherwise, ignore it.
nltk.download('averaged_perceptron_tagger')

In [None]:
# Let's rank the frequency of the different POS tags.
bts_tag_fd = nltk.FreqDist(tag for (word, tag) in bts_tagged)
print(bts_tag_fd.most_common())

In [None]:
# Now let's connect the frequency of the POS tags with the types, and print the ten most frequent types.
bts_type_tag_fd = nltk.FreqDist(bts_tagged)
bts_type_tag_fd.most_common(10)

We can sort the types within a certain POS category by frequency.

In [None]:
# What are the most popular verbs (base form)? How accurate is the tagger?
[typetag[0] for (typetag, _) in bts_type_tag_fd.most_common() if typetag[1] == "VB"]

### Step VI: N-grams and collocations

N-grams are words that co-occur within a given window: 2-grams (bigrams) are two words that co-occur, 3-grams (trigrams) are three words that co-occur, etc. The window is typically just one word (i.e., the words must be next to each other). 

N-gram collocations are n-grams that occur more often than we would expect based on the frequency of the individual words. We can compute them in Python using Pointwise Mutual Information.

In [None]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = BigramCollocationFinder.from_words(bts_tokenized)

# Top 20 bigrams.
print(bigram_finder.nbest(bigram_measures.pmi, 20))

In [None]:
# Let's filter the results to only see the top 20 bigrams that appear at least five times.
bigram_finder.apply_freq_filter(5)
print(bigram_finder.nbest(bigram_measures.pmi, 20))

In [None]:
# Repeat for trigrams.
trigram_measures = nltk.collocations.TrigramAssocMeasures()
trigram_finder = TrigramCollocationFinder.from_words(bts_tokenized)
trigram_finder.apply_freq_filter(5)
print(trigram_finder.nbest(trigram_measures.pmi, 20))

### Open-Ended Exercises and Questions
1. Rerun the above analyses for the BTS video for the EXO video (as they are rival bands!).
2. What are the most common adjectives in the BTS and EXO comments? The most common proper nouns?
3. What are the most common short words (length < 5)?
4. How often do the words 'love' and 'hate' occur in the BTS and EXO video comments?
5. What are the biggest issues with the above analyses, as applied to social web text vs. a traditional text corpus?