## Introduction to Python for Digital Text Analysis (Part II)

This session will provide an overview of the Python Natural Language Toolkit (NLTK) library (http://www.nltk.org), which is an excellent platform for examining linguistic data. It has built-in corpora and text processing libraries for everything from tokenisation to semantic reasoning. NLTK is also favoured by teachers of computational linguistics.

We will apply some basic NLTK functionalities to a few YouTube comment files in our Kpop dataset, and examine them individually as well as comparatively.

### Step I: Import necessary packages

In [138]:
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.probability import FreqDist
from nltk.tag import pos_tag, map_tag
from nltk.collocations import *

# To display our visualisations within the notebook and make them look prettier.
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')

### Step II: Read in comment files
Let's choose four popular songs, one from each of the Kpop groups, and import their comment text.

In [2]:
# First we need to save the paths of the comment-only files we are using.
bts_filepath = '../data/kpop_videos_comments/bts/GZjt_sA2eso.txt' # Save Me
exo_filepath = '../data/kpop_videos_comments/exo/yWfsla_Uh80.txt' # Call Me Baby
twice_filepath = '../data/kpop_videos_comments/twice/EpMwiqW8k8o.txt' # 'Signal' dance video
blackpink_filepath = '../data/kpop_videos_comments/blackpink/bwmSjveL3Lc.txt' # Boombayah

# Next, let's read in the text from the files as strings, using UTF 8 encoding to recognise emoji.
with open(bts_filepath, encoding="utf-8") as text:
    bts = text.read()
with open(exo_filepath, encoding="utf-8") as text:
    exo = text.read()
with open(twice_filepath, encoding="utf-8") as text:
    twice = text.read()  
with open(blackpink_filepath, encoding="utf-8") as text:
    blackpink = text.read()

print(bts[:300]) # Print first 300 characters.

CommentTextDisplay
V looks so awesome ! hope they come some day to germany xD
Please come to Greece Thessaloniki!!!
Who watch this in 2017?
Can se just appreciate how hot taehyung looks in the mv
I love you BTS ❤🙆
I love this song so much I can't explain how much I do
Vote for bts http://www.billboa


In [3]:
# Remove CommentTextDisplay from all files by stripping off the first 19 characters (part of metadata).
bts = bts[19:]
exo = exo[19:]
twice = twice[19:]
blackpink = blackpink[19:]
print(bts[:300])

V looks so awesome ! hope they come some day to germany xD
Please come to Greece Thessaloniki!!!
Who watch this in 2017?
Can se just appreciate how hot taehyung looks in the mv
I love you BTS ❤🙆
I love this song so much I can't explain how much I do
Vote for bts http://www.billboard.com/fan-army-bra


### Step III: Tokenise the comments

Now that we've read in the comment files as strings, let's tokenize them so that we can analyse their words in various ways. This will transform the strings into lists of 'words'.

NLTK's default word tokenizer ignores non-alpha characters (e.g., hashtags and emoji). We will use its tweet tokenizer, which recognises such characters and does not strip them away: http://www.nltk.org/api/nltk.tokenize.html

In [14]:
# Let's ignore cases, so that 'BTS' and 'bts' for example are treated as the same type.
bts_tokenized = TweetTokenizer().tokenize(bts.lower())
exo_tokenized = TweetTokenizer().tokenize(exo.lower())
twice_tokenized = TweetTokenizer().tokenize(twice.lower())
blackpink_tokenized = TweetTokenizer().tokenize(blackpink.lower())
print(bts_tokenized[:50]) # The first 50 'words'.

['v', 'looks', 'so', 'awesome', '!', 'hope', 'they', 'come', 'some', 'day', 'to', 'germany', 'xd', 'please', 'come', 'to', 'greece', 'thessaloniki', '!', '!', '!', 'who', 'watch', 'this', 'in', '2017', '?', 'can', 'se', 'just', 'appreciate', 'how', 'hot', 'taehyung', 'looks', 'in', 'the', 'mv', 'i', 'love', 'you', 'bts', '❤', '🙆', 'i', 'love', 'this', 'song', 'so', 'much']


### Step IV: Word-level calculations

After tokenizing our comment files, we can count their 'words' in a variety of different ways.

In [92]:
# Length of a text (token count). Note that punctuation symbols and emoji are also counted as tokens.
print(len(bts_tokenized))
print(len(exo_tokenized))
print(len(twice_tokenized))
print(len(blackpink_tokenized))

1065342
1498250
92904
1295104


In [90]:
# Number of unique vocabulary items (type count). Sets in Python contain unique objects.
print(len(set(bts_tokenized)))
print(len(set(exo_tokenized)))
print(len(set(twice_tokenized)))
print(len(set(blackpink_tokenized)))

32970
42870
6447
36842


From type count, we can calculate *lexical richness*: the number of unique words divided by the number of total words. Which Kpop video has the most lexically diverse comments?

In [94]:
def lexical_diversity(comments):
    return len(set(comments))/len(comments)

print(lexical_diversity(bts_tokenized))
print(lexical_diversity(exo_tokenized))
print(lexical_diversity(twice_tokenized))
print(lexical_diversity(blackpink_tokenized))

0.03094780830944429
0.02861338227932588
0.06939421338155516
0.028447136291757266


Now let's examine the frequency of specific types in the BTS dataset. We can see what the most common types are. We can also compare several band members to see who is mentioned more often.

In [53]:
# Frequency distribution of types.
fdistbts = FreqDist(bts_tokenized)
fdistexo = FreqDist(exo_tokenized)
fdisttwice = FreqDist(twice_tokenized)
fdistblackpink = FreqDist(blackpink_tokenized)

# 50 most frequent types in the BTS video comments.
fdistbts.most_common(50)

[('i', 31194),
 ('the', 26225),
 (',', 26032),
 ('.', 26017),
 ('!', 24543),
 ('and', 19378),
 ('to', 18892),
 ('this', 16273),
 ('is', 13632),
 ('me', 12380),
 ('you', 11159),
 ('bts', 10592),
 ('love', 10169),
 ('a', 9974),
 ('it', 9919),
 ('of', 9790),
 ('?', 9167),
 ('so', 9160),
 ('in', 8760),
 ('for', 8208),
 ('my', 7835),
 ("'", 7166),
 ('that', 7165),
 ('song', 6471),
 ('save', 6454),
 ('they', 6409),
 ('are', 6301),
 ('like', 5966),
 ('but', 5883),
 ('on', 5648),
 ('...', 5382),
 ('just', 5227),
 (')', 5175),
 ('we', 4928),
 ('(', 4650),
 ('them', 4592),
 ('with', 4541),
 ('all', 4468),
 ('their', 4370),
 ('not', 4065),
 ('can', 3995),
 ('*', 3936),
 ("i'm", 3913),
 ('be', 3773),
 ('-', 3755),
 (':', 3749),
 ('have', 3698),
 ('mv', 3672),
 ('one', 3544),
 ('was', 3518)]

Now let's examine specific words in the BTS dataset. We can compare several band members to see who is mentioned more often.

In [129]:
# Frequency of a specific type.
print(fdistbts["jimin"])
print(fdistbts["jungkook"])
print(fdistbts["suga"])

# Just for fun, let's see how often 'BTS' appears in the comments of rival group EXO's video, and vice versa.
print(fdistexo["bts"])
print(fdistbts["exo"])

1964
1674
1063
563
352


Now let's look at some more peculiar types: those that appear only once (hapax legomena), those that are extremely long, and those that are both long and frequently occurring. Such words often add a different perspective on a corpus of text (they're a bit like linguistic outliers!).

In [64]:
# How many hapax legomena are there in the BTS comments?
len(fdistbts.hapaxes())

18769

In [68]:
# Let's just look at the first 100.
print(fdistbts.hapaxes()[:100])

['12,898', '49.7', '방탄은', 'пинкан', 'keeep', 'ahahahahaha', 'haerteu', 'swong', 'adjusted', '2,324', 'œë', '🔋', 'flawlessly', '2679', '2263', '94.727', 'poit', 'specialized', '319', 'caetan', 'carlos', '@flower', '0:02', '99.450', 'sugus', 'pillowy', 'todayyy', 'peps', 'emeterio', 'haxwfp', 'snbchhjd', 'ajujuju', '(911) 420 6660', 'recommends', '79k', '@sihemkpop', 'spose', 'melany', '#youneverwatchalone', 'tsmclip', 'nrj', 'iin', 'mishra', 'faaaaaaaaaaaaaav', 'excercise', 'famil', '0:19', 'pv', 'peloo', 'jewelry', 'mama.mwave.me', 'johannesburg', 'đi', 'anuar', '2:52-', 'française', 'roblox', 'preperation', 'addication', "the've", 'eyebags', 'ssaaammmeeee', 'prefiro', 'http://www.thepetitionsite.com/es-es/589/324/503/demand-%22blood-sweat-tears%22-by-bts-be-played-on-the-radio/', 'magos', '.\n.', 'hehhh', 'https://www.change.org/p/big-hit-entertainment-bts-stylists-bring-back-jhope-s-forehead?tk=cduwspasg8atpapenfkofwmnudxp3wknsseyxtzshfo&utm_medium=email&utm_source=signature_receipt&

In [116]:
# All of the long words (more than 50 characters) in the BTS comments.
bts_vocab = set(bts_tokenized)
bts_long_words = [word for word in bts_vocab if len(word)>50]
sorted(bts_long_words)

['4ghijtebblkjregvhkugrhbugrehvi4rouugfoiruhg8754ughiuwtrhgiutrwhgurtwg',
 '___love___bangtan___jungkook___save___me___army___forever___',
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhh',
 'aaaaaaaaaaaaaammmmmmmmmmmmmmaaaaaazzzzzzzzzzzzzziiiiiiiiiiiiiiiiiiiiiiinnnnnnnnnnnnnnngggggggggggggggggg',
 'armyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy',
 'bbbbbbbbbbbbbbaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee',
 'bwvjhlfrbvjrthbvuhrfvihtrhgkjjuyhgrviwhbutbgkgojwrkhviutkuuslhgiuet',
 'eexxxpppppppeeeeeeeeennsssssssssiiiiivvvvvvveeeeeeee',
 'gooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo',
 'hahahahahhahahahahahhahahahahahahhahahahhahahahhahahahahahhahahahahahahahhahahahahahaahhahahahahahhahahahahhahahahahahhahahahahhahahahahhahahahahhahhahahahahhahahahahhahahahahahahahahahahahhahahahahhahahahahahhahaaahhahahhahahahahahhahhahahahah

In [121]:
# There are too many URLs in the 'long words list'. Let's remove them and try again.

bts_vocab_nourls = []

for word in bts_vocab:
    if not word.startswith('http'):
        if not word.startswith('www'):
            bts_vocab_nourls.append(word)

bts_long_words = [word for word in bts_vocab_nourls if len(word)>50]
sorted(bts_long_words)

['4ghijtebblkjregvhkugrhbugrehvi4rouugfoiruhg8754ughiuwtrhgiutrwhgurtwg',
 '___love___bangtan___jungkook___save___me___army___forever___',
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhh',
 'aaaaaaaaaaaaaammmmmmmmmmmmmmaaaaaazzzzzzzzzzzzzziiiiiiiiiiiiiiiiiiiiiiinnnnnnnnnnnnnnngggggggggggggggggg',
 'armyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy',
 'bbbbbbbbbbbbbbaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee',
 'bwvjhlfrbvjrthbvuhrfvihtrhgkjjuyhgrviwhbutbgkgojwrkhviutkuuslhgiuet',
 'eexxxpppppppeeeeeeeeennsssssssssiiiiivvvvvvveeeeeeee',
 'gooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo',
 'hahahahahhahahahahahhahahahahahahhahahahhahahahhahahahahahhahahahahahahahhahahahahahaahhahahahahahhahahahahhahahahahahhahahahahhahahahahhahahahahhahhahahahahhahahahahhahahahahahahahahahahahhahahahahhahahahahahhahaaahhahahhahahahahahhahhahahahah

In [127]:
# Now let's examine the words of at least 12 characters that occur more than 5 times.
bts_long_frequent_words = [word for word in bts_vocab_nourls if len(word)>12 and fdistbts[word]>5]
sorted(bts_long_frequent_words)

['#4thyearwithbts',
 '#mamaredcarpet',
 '#teamnottoday',
 '#teamspringday',
 '#teamyoungforever',
 'aesthetically',
 'automatically',
 'brendaofthedesert',
 'choreographed',
 'choreographer',
 'choreographies',
 'cinematography',
 'congratulation',
 'congratulations',
 'differentiate',
 'disappointment',
 'disrespectful',
 'entertainment',
 'extraordinary',
 'inappropriate',
 'international',
 'internationally',
 'personalities',
 'pronunciation',
 'recommendation',
 'recommendations',
 'representative',
 'saenggakhamyeon',
 'samkyeobeorin',
 'straightening',
 'thisisneverthat',
 'uncomfortable',
 'understanding',
 'unfortunately',
 'wiheomhajanha']

### Step V: POS tagging

Compute and visualise frequencies of most popular (proper) nouns, adjectives, verbs. Also frequencies of most popuar words overall...?

In [None]:
nltk.download()

bts_tagged = nltk.pos_tag(bts_tokenized, tagset="universal")
print(bts_tagged)

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [133]:
#tagged_text = tokenized_text.tagged_words(tagset="universal")
#tag_fd = nltk.FreqDist(tag for (word, tag) in tagged_text)
#print(tag_fd.most_common())

##Universal Part-of-Speech Tagset
##Tag	Meaning	                English Examples
##ADJ	adjective	        new, good, high, special, big, local
##ADP	adposition	        on, of, at, with, by, into, under
##ADV	adverb	                really, already, still, early, now
##CONJ	conjunction	        and, or, but, if, while, although
##DET	determiner, article	the, a, some, most, every, no, which
##NOUN	noun	                year, home, costs, time, Africa
##NUM	numeral	                twenty-four, fourth, 1991, 14:24
##PRT	particle	        at, on, out, over per, that, up, with
##PRON	pronoun	                he, their, her, its, my, I, us
##VERB	verb	                is, say, told, given, playing, would
##.	punctuation marks	. , ; !
##X	other	                ersatz, esprit, dunno, gr8, univeristy

# Adapt the below code to the Kpop dataset!

#TAGGED CORPORA
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories="news", tagset="universal")
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()
tag_fd.plot(cumulative=True)

#Which parts of speech occur before a noun?
word_tag_pairs = nltk.bigrams(brown_news_tagged) #Bigrams consist of word-tag pairs.
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == "NOUN"]
fdist_noun_preceders = nltk.FreqDist(noun_preceders)
fdist_noun_preceders.most_common() #Displays tags and frequencies.
[tag for (tag, _) in fdist_noun_preceders.most_common()] #Just displays tags.

#What are the most common verbs in the Wall Street Journal corpus?
wsj = nltk.corpus.treebank.tagged_words(tagset="universal")
word_tag_fd = nltk.FreqDist(wsj)
word_tag_fd.most_common(50)
#[wordtag[0] for (wordtag, _) in word_tag_fd.most_common() if wordtag[1] == "VERB"] #Sort verbs by frequency.

#Frequency-ordered list of POS tags given a word. Word is treated as a condition and its tag as an event.
cfd1 = nltk.ConditionalFreqDist(wsj)
cfd1["yield"].most_common()

#Reverse the order of the pairs to see likely words for a given POS tag.
wsj2 = nltk.corpus.treebank.tagged_words()
cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)
list(cfd2["VBN"])

LookupError: 
**********************************************************************
  Resource 'taggers/averaged_perceptron_tagger/averaged_perceptron
  _tagger.pickle' not found.  Please use the NLTK Downloader to
  obtain the resource:  >>> nltk.download()
  Searched in:
    - 'C:\\Users\\Periwynkle/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\Periwynkle\\Anaconda3\\nltk_data'
    - 'C:\\Users\\Periwynkle\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\Periwynkle\\AppData\\Roaming\\nltk_data'
**********************************************************************

### Step VI: Bigrams & Collocations

In [None]:
#Extract a list of word pairs from a text.
from nltk import bigrams
list(bigrams(["more", "is", "said", "than", "done"]))

#Collocations: bigrams that occur more often than we would expect based on the frequency of the individual words.
bts.collocations()

#Distribution of word lengths in a text.
word_lengths = [len(word) for word in bts]
fdist1wordlength = FreqDist(word_lengths)
print(fdist1wordlength)

#Most common word lengths.
fdist1wordlength.most_common()

#Most frequent word length.
fdist1wordlength.max()

#How many words of length 3 appear in the text.
fdist1wordlength[3]

#What proportion of all word lengths are words of length 3?
fdist1wordlength.freq(3)

#### Open-Ended Exercises/Questions
1. What are the most common 3-grams, 4-grams..?
2. Compare most frequent words (and types of words) in each of the four video comment datasets.