# T-725 Natural Language Processing: Lab 2
In today's lab, we will be working with text classification.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* Select `"Runtime" > "Run all"` to run the code in this notebook.

## List comprehensions in Python
List comprehensions are a concise way of creating lists in Python, and take the form:

```python
[expression for item in iterable]
```

A list comprehension creates a new list by evaluating some expression for every item in a given iterable (such as a string, a list or a dictionary). Let's look at an example:

In [None]:
sentence = "In a hole in the ground there lived a hobbit."
words = sentence.split()
print(words)

# Example of a list comprehension
word_lengths = [len(word) for word in words]
print(word_lengths)

# This is equal to
word_lengths = []
for word in words:
  word_lengths.append(len(word))

print(word_lengths)

['In', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit.']
[2, 1, 4, 2, 3, 6, 5, 5, 1, 7]
[2, 1, 4, 2, 3, 6, 5, 5, 1, 7]


You can also add a conditional statement to list comprehensions, so that the expression will only be evaluated for items that meet a certain criteria:

In [None]:
e_words = [word for word in words if len(word) > 5]
print(e_words)

['ground', 'hobbit.']


Python also has set and dictionary comprehensions:

In [None]:
lowercase_characters = {c.lower() for c in sentence}
print(lowercase_characters)

word_length = {word: len(word) for word in words}
print(word_length['ground'])

{' ', '.', 'h', 'b', 't', 'a', 'd', 'n', 'i', 'o', 'l', 'u', 'e', 'r', 'v', 'g'}
6


A nested list is a list within another list. You can iterate through nested lists in the following way:

In [None]:
# A list of countries and their capitals within different continents
continents = [
    [('Iceland', 'Reykjavík'), ('Germany', 'Berlin'), ('Spain', 'Madrid')],  # Europe
    [('Japan', 'Tokyo'), ('China', 'Beijing'), ('South Korea', 'Seoul')],  # Asia
    [('Nigeria', 'Abuja'), ('Algeria', 'Algiers'), ('Angola', 'Luanda')]  # Africa
]

# Create a list of all the countries in the previous list
[country for continent in continents for (country, capital) in continent]

['Iceland',
 'Germany',
 'Spain',
 'Japan',
 'China',
 'South Korea',
 'Nigeria',
 'Algeria',
 'Angola']

## Sentiment analysis with NLTK
[Chapter 6](https://www.nltk.org/book/ch06.html) of the NLTK book shows how the toolkit can be used to create document classifiers, including a sentiment analyzer. The NLTK includes the `movie_reviews` corpus, which contains 2,000 movie reviews. Half of the reviews have been labelled as **positive** and the other half as **negative**. Let's download it and take a look:

In [None]:
import nltk
from nltk.corpus import movie_reviews
nltk.download('punkt_tab')

nltk.download('movie_reviews')
print("Categories:", movie_reviews.categories())

Categories: ['neg', 'pos']


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


As expected, there are two categories: `pos` for positive reviews and `neg` for negative reviews. For this particular corpus, each review is stored as a separate text file. To get a list of all the text files in the corpus, we can use `movie_reviews.fileids()`. We can also get a list of files for a specific category:

In [None]:
pos_fileids = movie_reviews.fileids('pos')
neg_fileids = movie_reviews.fileids('neg')

print(pos_fileids[:5])  # The first 5 positive reviews
print(neg_fileids[:5])  # The first 5 negative reviews

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt']
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt']


We can get a list of all the tokens in the corpus with `movie_reviews.words()`. We can also specify a filename to get a single tokenized review:

In [None]:
pos_reviews = [movie_reviews.words(fid) for fid in pos_fileids]
neg_reviews = [movie_reviews.words(fid) for fid in neg_fileids]

print(pos_reviews[0][:10])  # The first 10 tokens of the first positive review
print(neg_reviews[0][:10])  # The first 10 tokens of the first negative review

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success']
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party']


Some words, such as *brilliant* and *memorable*, are more strongly associated with positive reviews than negative ones. Similarly, *boring* and *unfunny* have a stronger association with negative reviews.

Using the movie review corpus, we can train a classifier to predict whether a given review is positive or negative. The classifier extracts a set of *features* from every review, which are then used to make the classification. In this case, the features we use will be a dictionary that tells us whether each of the 2,000 most common words in the corpus is present within a review or not.

In [None]:
# Create a set with 2,000 of the most frequent words in the movie review corpus
movie_fd = nltk.FreqDist(movie_reviews.words())
movie_words = {word for word, count in movie_fd.most_common(2000)}

# For a given review (in the form of a list or set of tokens), create a
# dictionary which tells us which words are present and which are not.
def get_review_features(review):
  review_words = set(review)
  return {word: word in review_words for word in movie_words}


In [None]:
# Let's see how this works for the first positive review:
example_features = get_review_features(pos_reviews[0])
print("'funny' is in the review:", example_features['funny'])
print("'boring' is in the review:", example_features['boring'])

'funny' is in the review: True
'boring' is in the review: False


Next, let's create a training set that we can use to train a Naive Bayesian classifier. The training set, in this case, is a list of tuples in the format `[(features, category), ...]`, where `features` is a dictionary from `get_review_features()` and `category` is either `pos` or `neg`, depending on whether the review is positive or negative. To get an idea of how well the classifier performs, we're going to reserve 10% of the reviews for testing. That means that we'll be training our classifier on 1800 examples and testing it on 200 examples.

In [None]:
pos_examples = [(get_review_features(review), 'pos') for review in pos_reviews]
neg_examples = [(get_review_features(review), 'neg') for review in neg_reviews]

movie_training = pos_examples[:900] + neg_examples[:900]  # 1800 examples total
movie_test = pos_examples[900:] + neg_examples[900:]  # 200 examples total

Now we have everything we need to train our classifier.

In [None]:
movie_classifier = nltk.NaiveBayesClassifier.train(movie_training)

How well does it perform on the test set?

In [None]:
print("Accuracy:", nltk.classify.accuracy(movie_classifier, movie_test))

Accuracy: 0.815


The classifier achieves an accuracy of 81.5%. Let's take a look at which words have the biggest weights:

In [None]:
movie_classifier.show_most_informative_features(20)

Most Informative Features
             outstanding = True              pos : neg    =     15.6 : 1.0
                   mulan = True              pos : neg    =      9.0 : 1.0
             wonderfully = True              pos : neg    =      7.1 : 1.0
                  seagal = True              neg : pos    =      7.0 : 1.0
                   damon = True              pos : neg    =      6.1 : 1.0
                   flynt = True              pos : neg    =      5.7 : 1.0
                  wasted = True              neg : pos    =      5.6 : 1.0
                    lame = True              neg : pos    =      5.3 : 1.0
                  poorly = True              neg : pos    =      5.2 : 1.0
                   awful = True              neg : pos    =      4.9 : 1.0
              ridiculous = True              neg : pos    =      4.8 : 1.0
                    jedi = True              pos : neg    =      4.4 : 1.0
                 unfunny = True              neg : pos    =      4.4 : 1.0

# Assignment
Answer the following questions and hand in your solution in Canvas before 23:59 on September 5th. Remember to save your file before uploading it.

## Question 1
The NLTK also includes a `subjectivity` corpus, which contains a collection of sentences that have either been categorized as **subjective** (emotional, expressing personal feelings and views)  or **objective** (more rational, factual). Some examples:

* **Objective sentences**:
  * uma thurman stars in quentin tarantino's fourth film venture , kill bill .  
  * he lives in a motor garage with his six friends .
  * the ensuing battle was one of the most savage in u . s . history .
* **Subjective sentences**:
  * seagal's strenuous attempt at a change in expression could very well clinch him this year's razzie .
  * de niro cries . you'll cry for your money back .
  * a heroic tale of persistence that is sure to win viewers' hearts .

Unlike the movie review corpus, where every review is stored in separate file, here there is only one file for each category.

Complete the following tasks:
1. Import and download the `subjectivity` corpus.
2. Find the names of each category.
3. Using the category names, get the relative path of each file.
4. Get a list of tokenized sentences for each category (using `subjectivity.sents(fileid)`).

In [None]:
from nltk.corpus import subjectivity

nltk.download("subjectivity")
print("Categories: ", subjectivity.categories())

obj_fileid = subjectivity.fileids('obj')
subj_fileid = subjectivity.fileids('subj')

obj_tokenized_sents = subjectivity.sents(obj_fileid)
subj_tokenized_sents = subjectivity.sents(subj_fileid)

print("Objectives: ", obj_tokenized_sents[:5])
print("Subjective: ", subj_tokenized_sents[:5])


Categories:  ['obj', 'subj']
Objectives:  [['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi', 'from', 'a', 'hunter', '.'], ['emerging', 'from', 'the', 'human', 'psyche', 'and', 'showing', 'characteristics', 'of', 'abstract', 'expressionism', ',', 'minimalism', 'and', 'russian', 'constructivism', ',', 'graffiti', 'removal', 'has', 'secured', 'its', 'place', 'in', 'the', 'history', 'of', 'modern', 'art', 'while', 'being', 'created', 'by', 'artists', 'who', 'are', 'unconscious', 'of', 'their', 'artistic', 'achievements', '.'], ['spurning', 'her', "mother's", 'insistence', 'that', 'she', 'get', 'on', 'with', 'her', 'life', ',', 'mary', 'is', 'thrown', 'out', 'of', 'the', 'house', ',', 'rejected', 'by', 'joe', ',', 'and', 'expelled', 'from', 'school', 'as', 'she', 'grows', 'larger', 'with', 'child', '.'], ['amitabh', "can't", 'believe', 'the', 'board', 'of', 'directors', 'and', 'his', 'mind', 'is', 'filled', 'wit

[nltk_data] Downloading package subjectivity to /root/nltk_data...
[nltk_data]   Package subjectivity is already up-to-date!


## Question 2
Complete the following tasks:
1. Create a set with the 2,000 most common words in the `subjectivity` corpus using `nltk.FreqDist()`.
2. Create a function that takes a single, tokenized sentence as input (e.g., `['the', 'ensuing', 'battle', ...]`), and returns a dictionary of the 2,000 most frequent words and whether or not they are in the sentence (e.g., `{'battle': True, 'amusing': False, ...}`).

In [None]:
words = subjectivity.words()
common_tokens = nltk.FreqDist(words).most_common(2000) #tuple sequence
common_tokens = {word for (word, count) in common_tokens} #transform the tuple in a list of words

def frequent_words(sentence):
  set_sentence = set(sentence)
  return {word: word in set_sentence for word in common_tokens}

## Question 3
Complete the following tasks:
1. Create a training set with 9,000 sentences (4,500 of each category)
2. Create a test set with 1,000 sentences (500 of each category)

In [None]:
obj_samples = [(frequent_words(sent), "obj") for sent in obj_tokenized_sents]
subj_samples = [(frequent_words(sent), "subj") for sent in subj_tokenized_sents]

train_sample = obj_samples[:4500] + subj_samples[:4500]
test_sample = obj_samples[4500:] + subj_samples[4500:]

## Question 4
Complete the following tasks:
1. Train a Naive Bayes classifier using the training set from the previous question.
2. Evaluate the classifier on the test set. How accurate is it?
3. Find the 20 most informative features.

In [None]:
sentences_classifier = nltk.NaiveBayesClassifier.train(train_sample)
print("Accuracy:", nltk.classify.accuracy(sentences_classifier, test_sample))
sentences_classifier.show_most_informative_features(20)

Accuracy: 0.906
Most Informative Features
                      -- = True             subj : obj    =     70.1 : 1.0
                   order = True              obj : subj   =     39.0 : 1.0
                 decides = True              obj : subj   =     35.7 : 1.0
                  sister = True              obj : subj   =     27.7 : 1.0
            entertaining = True             subj : obj    =     26.6 : 1.0
              girlfriend = True              obj : subj   =     26.3 : 1.0
                discover = True              obj : subj   =     25.0 : 1.0
                  film's = True             subj : obj    =     25.0 : 1.0
                  you're = True             subj : obj    =     22.6 : 1.0
                daughter = True              obj : subj   =     22.4 : 1.0
                 married = True              obj : subj   =     21.7 : 1.0
                 amusing = True             subj : obj    =     19.7 : 1.0
                   plans = True              obj : subj   

# Question 5
Dialog acts are sort of the type of *action* performed by the speaker. In the instant messaging corpus dataset 'NPS', each utterance is labeled with one of 15 dialogue act types, such as **Statement**, **Emotion**, **ynQuestion**, **Continuer**, etc.

Your task is to classify text from the NPS corpus into two dialog acts: **whQuestion** or **Emotion**.

Start by downloading the NPS corpus and getting all posts from the corpus:

In [None]:
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()

[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!


Create a list that only includes posts of class **Emotion** and **whQuestion**. You can access the class of a post by calling `post.get("class")`.

In [None]:
emotions_posts = [post for post in posts if post.get("class") == "Emotion"]
whQuestions_posts = [post for post in posts if post.get("class") == "whQuestion"]
needed_posts = emotions_posts + whQuestions_posts



Randomize the posts and create a training set and a test set, where the first 1300 **Emotion + whQuestion** posts are used for training and the rest for testing.

In [None]:
import random
random.shuffle(needed_posts)

train_set = needed_posts[:1300]
test_set = needed_posts[1300:]


Create a list of the 200 most frequent tokens in the training set. You can access the text of a `post` object by calling `post.text`. Remember that the **split** function will use whitespace to tokenize a string: `some_string.split()`

In [None]:
tokenized_posts = [post.text.lower().split() for post in train_set]
all_posts_words = [word for post in tokenized_posts for word in post]
frequent_tokens = nltk.FreqDist(all_posts_words).most_common(200)
frequent_tokens = [word for word, count in frequent_tokens]
print(frequent_tokens)


['lol', 'you', 'what', 'how', 'lmao', 'the', 'are', 'who', 'is', 'to', 'u', 'haha', 'whats', 'and', 'why', ':)', 'where', 'in', 'up', 'did', 'do', 'i', 'a', 'that', 'it', 'omg', 'so', 'about', 'was', 'for', 'of', 'here', '?', 'oh', ';)', ';-)', 'all', 'your', 'damn', '11-09-40suser18', 'r', 'everyone', 'ok', 'on', 'good', 'me', 'hahaha', '11-08-adultsuser65', ':-)', 'with', "what's", ':p', 'when', 'but', 'you?', 'from', 'my', 'have', 'come', 'what?', 'doing', '10-19-40suser9', 'hi', 'ya', 'wants', 'from?', '10-19-adultsuser35', "who's", '@', '10-19-adultsuser23', 'or', 'been', 'like', 'her', 'ha', 'girl', '10-19-30suser11', 'at', '11-08-20suser21', '10-19-40suser3', '10-19-20suser7', 'o.o', ':tongue:', 'he', '???', 'hows', 'there', '..', 'which', '10-19-adultsuser28', 'happened', '11-09-40suser7', 'we', 'doin', 'yes', 'hahah', '10-19-20suser121', 'lmfao', 'name', 'does', 'get', 'laffs', '10-19-30suser31', '10-19-adultsuser47', ':-o', 'know', '11-06-adultsuser105', '((((((((((', 'can', 


Define two feature selection functions that take a string as input and output a dictionary of features:
* `get_word_features(string)`
* `get_custom_features(string)`

Begin by defining `get_word_features`. This function should use the words as features, just like in the movie review example above.




In [None]:
def get_word_features(string):
  tokenized_string = set(string.lower().split())
  return {word: word in tokenized_string for word in frequent_tokens}


Next, define `get_custom_features`. This function should extract the features from the text that characterize the **Emotion** and **whQuestions** classes.

In [None]:
#return the "count" number of most frequent tokens of a spcific class in a post list
def common_tokens_by_class(post_list, post_class, count):
  class_posts = [post for post in post_list if post.get("class") == post_class]
  tokenized_list = [post.text.lower().split() for post in class_posts]
  all_class_tokens = [word for post in tokenized_list for word in post]
  return {word for word,_ in nltk.FreqDist(all_class_tokens).most_common(count)}

#find the most common tokens for emotion and whquestion and join them togheter
emotion_common_tokens = common_tokens_by_class(posts, "Emotion", 30)
whquestion_common_token = common_tokens_by_class(posts, "whQuestion", 30)
custom_common_tokens = set(emotion_common_tokens | whquestion_common_token)

def get_custom_features(string):
  tokens_list = set(string.lower().split())
  return {word: word in tokens_list for word in custom_common_tokens}

get_custom_features("this is a sentence lol haha")

{'about': False,
 'hehe': False,
 'what': False,
 'of': False,
 '11-09-40suser31': False,
 '10-19-40suser9': False,
 'who': False,
 'a': True,
 ';)': False,
 'did': False,
 'how': False,
 'you': False,
 'where': False,
 'it': False,
 ':tongue:': False,
 ';-)': False,
 'oh': False,
 'hahah': False,
 '10-19-40suser3': False,
 'for': False,
 'why': False,
 'the': False,
 '?': False,
 'lmao': False,
 'up': False,
 'haha': True,
 ':)': False,
 '11-08-adultsuser65': False,
 '11-08-20suser21': False,
 'r': False,
 'is': True,
 '10-19-adultsuser47': False,
 'hahaha': False,
 'ha': False,
 'damn': False,
 'omg': False,
 'good': False,
 'her': False,
 'that': False,
 ':-o': False,
 'i': False,
 'in': False,
 ':p': False,
 'u': False,
 '10-19-20suser7': False,
 'are': False,
 ':-)': False,
 'so': False,
 'your': False,
 'whats': False,
 '@': False,
 'was': False,
 'here': False,
 'to': False,
 'and': False,
 '((((((((((': False,
 '11-09-40suser18': False,
 'do': False,
 'lol': True}

Conduct the following tasks:
*   Train two Naive Bayes classifiers on the **Emotion + whQuestions** training set: one that uses the `get_word_features` function and another using `get_custom_features`.
*   Evaluate each classifier on the test set. How accurate are they? Which one is better?
*   What are the 20 most informative features for each classifier?


In [None]:
#model train using get_word_feature function
emotion_train = [(get_word_features(sentence.text), sentence.get("class")) for sentence in train_set if sentence.get("class") == "Emotion"]
whQuestion_train = [(get_word_features(sentence.text), sentence.get("class")) for sentence in train_set if sentence.get("class") == "whQuestion"]

labeled_train = emotion_train + whQuestion_train
random.shuffle(labeled_train)

emotion_test = [(get_word_features(sentence.text), sentence.get("class")) for sentence in test_set if sentence.get("class") == "Emotion"]
whQuestion_test = [(get_word_features(sentence.text), sentence.get("class")) for sentence in test_set if sentence.get("class") == "whQuestion"]
labeled_test = emotion_test + whQuestion_test
random.shuffle(labeled_test)

first_classification = nltk.NaiveBayesClassifier.train(labeled_train)
print("Accuracy:", nltk.classify.accuracy(first_classification, labeled_test))
first_classification.show_most_informative_features(20)



#model train using get_custom_features function
emotion_train = [(get_custom_features(sentence.text), sentence.get("class")) for sentence in train_set if sentence.get("class") == "Emotion"]
whQuestion_train = [(get_custom_features(sentence.text), sentence.get("class")) for sentence in train_set if sentence.get("class") == "whQuestion"]

labeled_train = emotion_train + whQuestion_train
random.shuffle(labeled_train)

emotion_test = [(get_custom_features(sentence.text), sentence.get("class")) for sentence in test_set if sentence.get("class") == "Emotion"]
whQuestion_test = [(get_custom_features(sentence.text), sentence.get("class")) for sentence in test_set if sentence.get("class") == "whQuestion"]
labeled_test = emotion_test + whQuestion_test
random.shuffle(labeled_test)

second_classification = nltk.NaiveBayesClassifier.train(labeled_train)
print("Accuracy:", nltk.classify.accuracy(second_classification, labeled_test))
second_classification.show_most_informative_features(20)














Accuracy: 0.9321533923303835
Most Informative Features
                     how = True           whQues : Emotio =     94.6 : 1.0
                       u = True           whQues : Emotio =     50.4 : 1.0
                     you = True           whQues : Emotio =     34.7 : 1.0
                     and = True           whQues : Emotio =     32.4 : 1.0
                     the = True           whQues : Emotio =     29.9 : 1.0
                      in = True           whQues : Emotio =     26.9 : 1.0
                     lol = True           Emotio : whQues =     24.3 : 1.0
                    that = True           whQues : Emotio =     21.4 : 1.0
                    lmao = True           Emotio : whQues =     12.3 : 1.0
                      to = True           whQues : Emotio =     11.5 : 1.0
                      me = True           whQues : Emotio =     10.4 : 1.0
                      so = True           whQues : Emotio =     10.4 : 1.0
                   doing = True           whQ