# T-725 Natural Language Processing: Lab 2
In today's lab, we will be working with text classification.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* Select `"Runtime" > "Run all"` to run the code in this notebook.

## List comprehensions in Python
List comprehensions are a concise way of creating lists in Python, and take the form:

```python
[expression for item in iterable]
```

A list comprehension creates a new list by evaluating some expression for every item in a given iterable (such as a string, a list or a dictionary). Let's look at an example:

In [1]:
sentence = "In a hole in the ground there lived a hobbit."
words = sentence.split()
print(words)

# Example of a list comprehension
word_lengths = [len(word) for word in words]
print(word_lengths)

# This is equal to
word_lengths = []
for word in words:
  word_lengths.append(len(word))

print(word_lengths)

['In', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit.']
[2, 1, 4, 2, 3, 6, 5, 5, 1, 7]
[2, 1, 4, 2, 3, 6, 5, 5, 1, 7]


You can also add a conditional statement to list comprehensions, so that the expression will only be evaluated for items that meet a certain criteria:

In [2]:
e_words = [word for word in words if len(word) > 5]
print(e_words)

['ground', 'hobbit.']


Python also has set and dictionary comprehensions:

In [3]:
lowercase_characters = {c.lower() for c in sentence}
print(lowercase_characters)

word_length = {word: len(word) for word in words}
print(word_length['ground'])

{'v', 'n', 'o', 't', 'd', 'b', 'e', 'u', 'a', 'l', ' ', 'g', 'r', 'h', 'i', '.'}
6


A nested list is a list within another list. You can iterate through nested lists in the following way:

In [4]:
# A list of countries and their capitals within different continents
continents = [
    [('Iceland', 'Reykjavík'), ('Germany', 'Berlin'), ('Spain', 'Madrid')],  # Europe
    [('Japan', 'Tokyo'), ('China', 'Beijing'), ('South Korea', 'Seoul')],  # Asia
    [('Nigeria', 'Abuja'), ('Algeria', 'Algiers'), ('Angola', 'Luanda')]  # Africa
]

# Create a list of all the countries in the previous list
[country for continent in continents for (country, _) in continent]

['Iceland',
 'Germany',
 'Spain',
 'Japan',
 'China',
 'South Korea',
 'Nigeria',
 'Algeria',
 'Angola']

## Sentiment analysis with NLTK
[Chapter 6](https://www.nltk.org/book/ch06.html) of the NLTK book shows how the toolkit can be used to create document classifiers, including a sentiment analyzer. The NLTK includes the `movie_reviews` corpus, which contains 2,000 movie reviews. Half of the reviews have been labelled as **positive** and the other half as **negative**. Let's download it and take a look:

In [5]:
import nltk
from nltk.corpus import movie_reviews
nltk.download('punkt', quiet=True)

nltk.download('movie_reviews', quiet=True)
print("Categories:", movie_reviews.categories())

Categories: ['neg', 'pos']


As expected, there are two categories: `pos` for positive reviews and `neg` for negative reviews. For this particular corpus, each review is stored as a separate text file. To get a list of all the text files in the corpus, we can use `movie_reviews.fileids()`. We can also get a list of files for a specific category:

In [6]:
# IDs of all positive reviews
pos_fileids = movie_reviews.fileids('pos')
# IDs of all negative reviews
neg_fileids = movie_reviews.fileids('neg')

print("positive reviews:", len(pos_fileids))
print("negative reviews:", len(neg_fileids))

print(pos_fileids[:5])  # The first 5 positive reviews
print(neg_fileids[:5])  # The first 5 negative reviews

positive reviews: 1000
negative reviews: 1000
['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt']
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt']


We can get a list of all the tokens in the corpus with `movie_reviews.words()`. We can also specify a filename to get a single tokenized review:

In [7]:
pos_reviews = [movie_reviews.words(fid) for fid in pos_fileids]
neg_reviews = [movie_reviews.words(fid) for fid in neg_fileids]

print(pos_reviews[0][:10])  # The first 10 tokens of the first positive review
print(neg_reviews[0][:10])  # The first 10 tokens of the first negative review

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success']
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party']


Some words, such as *brilliant* and *memorable*, are more strongly associated with positive reviews than negative ones. Similarly, *boring* and *unfunny* have a stronger association with negative reviews.

Using the movie review corpus, we can train a classifier to predict whether a given review is positive or negative. The classifier extracts a set of *features* from every review, which are then used to make the classification. In this case, the features we use will be a dictionary that tells us whether each of the 2,000 most common words in the corpus is present within a review or not.

In [8]:
# Create a set with 2,000 of the most frequent words in the movie review corpus
movie_fd = nltk.FreqDist(movie_reviews.words())
movie_words = {word for word, _ in movie_fd.most_common(2000)}

# For a given review (in the form of a list or set of tokens), create a
# dictionary which tells us which words are present and which are not.
def get_review_features(review):
  review_set = set(review)
  return {word: (word in review_set) for word in movie_words}

In [9]:
# Let's see how this works for the first positive review:
example_features = get_review_features(pos_reviews[0])

print("'funny' is in the review:", example_features['funny'])
print("'boring' is in the review:", example_features['boring'])

'funny' is in the review: True
'boring' is in the review: False


Next, let's create a training set that we can use to train a Naive Bayesian classifier. The training set, in this case, is a list of tuples in the format `[(features, category), ...]`, where `features` is a dictionary from `get_review_features()` and `category` is either `pos` or `neg`, depending on whether the review is positive or negative. To get an idea of how well the classifier performs, we're going to reserve 10% of the reviews for testing. That means that we'll be training our classifier on 1800 examples and testing it on 200 examples.

In [10]:
import random
random.seed(42)

In [11]:
pos_examples = [(get_review_features(review), 'pos') for review in pos_reviews]
neg_examples = [(get_review_features(review), 'neg') for review in neg_reviews]

movie_training = pos_examples[:900] + neg_examples[:900]  # 1800 examples total
movie_test = pos_examples[900:] + neg_examples[900:]  # 200 examples total

Now we have everything we need to train our classifier.

In [12]:
movie_classifier = nltk.NaiveBayesClassifier.train(movie_training)

How well does it perform on the test set?

In [13]:
print(f"Accuracy: {nltk.classify.accuracy(movie_classifier, movie_test)*100.:.2f}%")

Accuracy: 81.50%


The classifier achieves an accuracy of 81.5%. Let's take a look at which words have the biggest weights:

In [14]:
movie_classifier.show_most_informative_features(20)

Most Informative Features
             outstanding = True              pos : neg    =     15.6 : 1.0
                   mulan = True              pos : neg    =      9.0 : 1.0
             wonderfully = True              pos : neg    =      7.1 : 1.0
                  seagal = True              neg : pos    =      7.0 : 1.0
                   damon = True              pos : neg    =      6.1 : 1.0
                   flynt = True              pos : neg    =      5.7 : 1.0
                  wasted = True              neg : pos    =      5.6 : 1.0
                    lame = True              neg : pos    =      5.3 : 1.0
                  poorly = True              neg : pos    =      5.2 : 1.0
                   awful = True              neg : pos    =      4.9 : 1.0
              ridiculous = True              neg : pos    =      4.8 : 1.0
                    jedi = True              pos : neg    =      4.4 : 1.0
                 unfunny = True              neg : pos    =      4.4 : 1.0

# Assignment
Answer the following questions and hand in your solution in Canvas before 8:30 AM Monday September 11th. Remember to save your file before uploading it.

## Question 1
The NLTK also includes a `subjectivity` corpus, which contains a collection of sentences that have either been categorized as **subjective** (emotional, expressing personal feelings and views)  or **objective** (more rational, factual). Some examples:

* **Objective sentences**:
  * uma thurman stars in quentin tarantino's fourth film venture , kill bill .  
  * he lives in a motor garage with his six friends .
  * the ensuing battle was one of the most savage in u . s . history .
* **Subjective sentences**:
  * seagal's strenuous attempt at a change in expression could very well clinch him this year's razzie .
  * de niro cries . you'll cry for your money back .
  * a heroic tale of persistence that is sure to win viewers' hearts .

Unlike the movie review corpus, where every review is stored in separate file, here there is only one file for each category.

Complete the following tasks:
1. Import and download the `subjectivity` corpus.
2. Find the names of each category.
3. Using the category names, get the relative path of each file.
4. Get a list of tokenized sentences for each category (using `subjectivity.sents(fileid)`).

In [15]:
# Your solution here
from nltk.corpus import subjectivity

nltk.download('subjectivity', quiet=True)
print("Categories:", subjectivity.categories())

Categories: ['obj', 'subj']


In [16]:
# IDs of all positive reviews
obj_file = subjectivity.fileids('obj')
subj_file = subjectivity.fileids('subj')

print(obj_file)
print(subj_file)

obj_sents = subjectivity.sents(obj_file)
subj_sents = subjectivity.sents(subj_file)

print(obj_sents)
print(subj_sents)

['plot.tok.gt9.5000']
['quote.tok.gt9.5000']
[['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi', 'from', 'a', 'hunter', '.'], ['emerging', 'from', 'the', 'human', 'psyche', 'and', 'showing', 'characteristics', 'of', 'abstract', 'expressionism', ',', 'minimalism', 'and', 'russian', 'constructivism', ',', 'graffiti', 'removal', 'has', 'secured', 'its', 'place', 'in', 'the', 'history', 'of', 'modern', 'art', 'while', 'being', 'created', 'by', 'artists', 'who', 'are', 'unconscious', 'of', 'their', 'artistic', 'achievements', '.'], ...]
[['smart', 'and', 'alert', ',', 'thirteen', 'conversations', 'about', 'one', 'thing', 'is', 'a', 'small', 'gem', '.'], ['color', ',', 'musical', 'bounce', 'and', 'warm', 'seas', 'lapping', 'on', 'island', 'shores', '.', 'and', 'just', 'enough', 'science', 'to', 'send', 'you', 'home', 'thinking', '.'], ...]


## Question 2
Complete the following tasks:
1. Create a set with the 2,000 most common words in the `subjectivity` corpus using `nltk.FreqDist()`.
2. Create a function that takes a single, tokenized sentence as input (e.g., `['the', 'ensuing', 'battle', ...]`), and returns a dictionary of the 2,000 most frequent words and whether or not they are in the sentence (e.g., `{'battle': True, 'amusing': False, ...}`).

In [17]:
# Your solution here
subjectivity_fd = nltk.FreqDist(subjectivity.words())
subject_words = {word for word, _ in subjectivity_fd.most_common(2000)}

print(subject_words)

{'friendship', 'much', 'parts', 'horrible', 'knowing', 'six', 'sports', 'seven', 'free', 'changes', 'cause', 'home', 'ideas', 'assassin', 'cross', 'independent', 'against', 'enough', 'enterprise', 'tried', 'entire', 'mental', 'was', 'wealthy', 'epic', 'coming-of-age', 'very', 'old', 'spell', 'plenty', 'originality', 'affair', 'sounds', 'blue', 'faced', 'birthday', 'unfortunately', 'flick', 'truth', 'inspired', 'say', 'killed', 'screenplay', 'makes', 'shows', 'clear', 'hunt', 'year', 'water', 'upon', 'violence', 'bright', 'had', 'what', "they're", 'feels', 'age', 'caught', 'pulls', 'musical', 'to', 'levels', 'they', 'human', 'knows', 'up', 'character', 'fire', '2001', 'steven', 'bed', 'incredible', 'personality', 'due', '?', 'teenagers', 'chilling', 'force', 'lawyer', 'patricia', 'joy', 'guilt', 'again', 'pictures', 'opportunity', 'surprise', 'travels', 'share', 'kong', "'the", 'returns', 'too', 'filmmakers', 'best', 'arts', 'every', 'meanwhile', 'clever', 'going', 'drive', 'confront', 

In [18]:
def get_features(tokens, most_common_tokens):
  tokens_set = set(tokens)
  return {word: (word in tokens_set) for word in most_common_tokens}

## Question 3
Complete the following tasks:
1. Create a training set with 9,000 sentences (4,500 of each category)
2. Create a test set with 1,000 sentences (500 of each category)

In [19]:
# Your solution here
import random
random.seed(42)

def split_train_test(examples, train_fraction=0.8, train_count=-1):
    # Shuffle the examples
    random.shuffle(examples)

    if train_count < 0:
        train_count = int(len(examples) * train_fraction)
    else:
        assert (train_count < len(examples)), f"Tried to extract too many training samples! ({train_count} out of {len(examples)})"

    # Split the examples
    train_set = examples[:train_count]
    test_set = examples[train_count:]

    return train_set, test_set

In [20]:
obj_features = [(get_features(sent, subject_words), "obj") for sent in obj_sents]
subj_features = [(get_features(sent, subject_words), "subj") for sent in subj_sents]

train, test = split_train_test(obj_features + subj_features, train_fraction=0.9)

print(len(train))
print(len(test))

9000
1000


## Question 4
Complete the following tasks:
1. Train a Naive Bayes classifier using the training set from the previous question.
2. Evaluate the classifier on the test set. How accurate is it?
3. Find the 20 most informative features.

In [21]:
# Your solution here
classif_nb = nltk.NaiveBayesClassifier
classif_nb = classif_nb.train(train)

In [22]:
print(f"Accuracy: {nltk.classify.accuracy(classif_nb, test)*100.:.2f}%")

Accuracy: 89.40%


In [23]:
classif_nb.show_most_informative_features(20)

Most Informative Features
                      -- = True             subj : obj    =     88.8 : 1.0
             fascinating = True             subj : obj    =     29.6 : 1.0
                discover = True              obj : subj   =     28.4 : 1.0
                    fans = True             subj : obj    =     26.9 : 1.0
                    town = True              obj : subj   =     26.7 : 1.0
            entertaining = True             subj : obj    =     26.1 : 1.0
                  film's = True             subj : obj    =     25.7 : 1.0
             interesting = True             subj : obj    =     25.7 : 1.0
              girlfriend = True              obj : subj   =     24.4 : 1.0
                    plan = True              obj : subj   =     21.9 : 1.0
               detective = True              obj : subj   =     21.7 : 1.0
                daughter = True              obj : subj   =     21.1 : 1.0
                 mission = True              obj : subj   =     21.1 : 1.0

# Question 5
Dialog acts are sort of the type of *action* performed by the speaker. In the instant messaging corpus dataset 'NPS', each utterance is labeled with one of 15 dialogue act types, such as **Statement**, **Emotion**, **ynQuestion**, **Continuer**, etc.

Your task is to classify text from the NPS corpus into two dialog acts: **whQuestion** or **Emotion**.

Start by downloading the NPS corpus and getting all posts from the corpus:

In [24]:
from nltk.corpus import nps_chat

nltk.download('nps_chat', quiet=True)
posts_raw = nps_chat.xml_posts()

Create a list that only includes posts of class **Emotion** and **whQuestion**. You can access the class of a post by calling `post.get("class")`.

In [25]:
# Your solution here
print(f"Total count: {len(posts_raw)}")

posts = [p for p in posts_raw if p.get('class') in ['Emotion', 'whQuestion']]

print(f"Selected posts: {len(posts)}")

Total count: 10567
Selected posts: 1639


In [26]:
import re

def filter_post_text(post):
    tmp = re.sub(r"[\(\)]{2,}", "", post.text)
    tmp = re.sub(r"(\d\d\-){2,3}", "", tmp)

    return tmp #post.text.replace("(", "")

Randomize the posts and create a training set and a test set, where the first 1300 **Emotion + whQuestion** posts are used for training and the rest for testing.

In [27]:
# Your solution here
posts_em = [filter_post_text(p).split(" ") for p in posts if p.get('class') == 'Emotion']
posts_whq = [filter_post_text(p).split(" ") for p in posts if p.get('class') == 'whQuestion']

posts_train, posts_test = split_train_test(posts_em + posts_whq, train_count=1300)

Create a list of the 200 most frequent tokens in the training set. You can access the text of a `post` object by calling `post.text`. Remember that the **split** function will use whitespace to tokenize a string: `some_string.split()`

In [28]:
# Your solution here
train_texts = [word for p in posts_train for word in p]
posts_train_fd = nltk.FreqDist(train_texts)
posts_train_words = {word for word, _ in posts_train_fd.most_common(2000)}

print(posts_train_words)

{'', 'much', 'quiet.', 'laffs', 'forever...the', 'Lol', 'home', 'omggg', 'friendly', 'adultsUser29!!!', 'holdin', '20sUser121!!!', 'was', 'Lmao.', 'old', 'ooer', 'blue', 'ass', 'birthday', 'say', '30sUser11,', '20sUser87', '40sUser6', 'stranger', 'opps', 'jesus', 'coupons', 'eeewwwwww', 'Marlaya<333333333!!!!!', 'what', '20sUser83', 'to', ':tongue:', 'battery', 'wut', 'adultsUser35..WTF', 'they', 'adultsUser41?', 'up', 'mmmm', '?', 'o.O', 'joy', 'wee', 'pink,', 'chick', '...', 'pervs', 'island?', 'tonight?', 'too', 'going', 'white', 'wha?', '40sUser25', 'adultsUser53?', 'U', 'to?', 'DAMN', 'job', 'yikesssss', 'wrong', 'o0', 'lord', 'hahahah', 'awhile', 'That', 'feeling', 'him', 'honey.....how', '"pussy', 'way,', 'Whereabouts', 'kold....lol', 'DOES', '40sUser21', 'adultsUser0?', 'Lol.', 'rotflmao', 'dman', 'happen?', 'pop', '40sUser7..i', ':|', 'burns', "you....I'm", 'yall', 'whose', 'me', 'adultsUser19?', 'mike', 'middle', '20sUser6?', 'song', 'Ships?', '40sUser9', 'pissed', ':o', 'mwa


Define two feature selection functions that take a string as input and output a dictionary of features:
* `get_word_features(string)`
* `get_custom_features(string)`

Begin by defining `get_word_features`. This function should use the words as features, just like in the movie review example above.




In [29]:
# Your solution here
def get_word_features(post):
    return get_features(post, posts_train_words)

Next, define `get_custom_features`. This function should extract the features from the text that characterize the **Emotion** and **whQuestions** classes.

In [45]:
# Your solution here
def get_custom_features(post):
    features = {}

    post = " ".join(post).lower()

    features['has_lol'] = len(re.findall(r"l(o+)l", post))
    features['has_lmao'] = len(re.findall(r"lma(o+)", post))
    features['has_laugh'] = len(re.findall(r"(h[aeio])+", post))
    features['has_emoji'] = len(re.findall(r"[=:]([\(\)/\\P3])+", post))
    features['has_qmark'] = len(re.findall(r"\?+$", post))
    features['has_exmark'] = len(re.findall(r"!+$", post))
    features['has_wh_word'] = re.search('(what|where|when|who|why|wtf)', post) is not None

    return features

Conduct the following tasks:
*   Train two Naive Bayes classifiers on the **Emotion + whQuestions** training set: one that uses the `get_word_features` function and another using `get_custom_features`.
*   Evaluate each classifier on the test set. How accurate are they? Which one is better?
*   What are the 20 most informative features for each classifier?


In [31]:
# Your solution here
posts_em_wordf = [(get_word_features(post), 'Emotion') for post in posts_em]
posts_whq_wordf = [(get_word_features(post), 'whQuestion') for post in posts_em]

posts_train, posts_test = split_train_test(posts_em_wordf + posts_whq_wordf, train_count=1300)

classif_posts_words = nltk.NaiveBayesClassifier.train(posts_train)
classif_posts_words.show_most_informative_features(20)
print(f"Accuracy: {nltk.classify.accuracy(classif_posts_words, posts_test)*100.:.2f}%")

Most Informative Features
            adultsUser35 = True           whQues : Emotio =      2.9 : 1.0
                     you = True           whQues : Emotio =      2.9 : 1.0
            adultsUser53 = True           Emotio : whQues =      2.4 : 1.0
                 giggles = True           Emotio : whQues =      2.4 : 1.0
                      it = True           Emotio : whQues =      2.4 : 1.0
              20sUser121 = True           whQues : Emotio =      2.3 : 1.0
                     :-o = True           whQues : Emotio =      2.3 : 1.0
                      =( = True           whQues : Emotio =      2.3 : 1.0
            adultsUser16 = True           whQues : Emotio =      2.3 : 1.0
                    hugs = True           whQues : Emotio =      2.3 : 1.0
                     o.O = True           whQues : Emotio =      2.3 : 1.0
                20sUser7 = True           Emotio : whQues =      2.2 : 1.0
               40sUser31 = True           Emotio : whQues =      1.8 : 1.0

In [46]:
posts_em_customf = [(get_custom_features(post), 'Emotion') for post in posts_em]
posts_whq_customf = [(get_custom_features(post), 'whQuestion') for post in posts_em]

posts_train, posts_test = split_train_test(posts_em_customf + posts_whq_customf, train_count=1300)

classif_posts_words = nltk.NaiveBayesClassifier.train(posts_train)
classif_posts_words.show_most_informative_features(20)
print(f"Accuracy: {nltk.classify.accuracy(classif_posts_words, posts_test)*100.:.2f}%")

Most Informative Features
               has_laugh = 3              whQues : Emotio =      1.6 : 1.0
               has_emoji = 1              Emotio : whQues =      1.3 : 1.0
              has_exmark = 1              whQues : Emotio =      1.1 : 1.0
               has_laugh = 2              whQues : Emotio =      1.1 : 1.0
                has_lmao = 1              Emotio : whQues =      1.1 : 1.0
             has_wh_word = True           Emotio : whQues =      1.1 : 1.0
                 has_lol = 2              Emotio : whQues =      1.1 : 1.0
               has_laugh = 4              Emotio : whQues =      1.1 : 1.0
               has_laugh = 1              Emotio : whQues =      1.1 : 1.0
                 has_lol = 1              whQues : Emotio =      1.0 : 1.0
                 has_lol = 0              Emotio : whQues =      1.0 : 1.0
               has_emoji = 0              whQues : Emotio =      1.0 : 1.0
                has_lmao = 0              whQues : Emotio =      1.0 : 1.0