# T-725 Natural Language Processing: Lab 2
In today's lab, we will be working with text classification.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* Select `"Runtime" > "Run all"` to run the code in this notebook.

## List comprehensions in Python
List comprehensions are a concise way of creating lists in Python, and take the form:

```python
[expression for item in iterable]
```

A list comprehension creates a new list by evaluating some expression for every item in a given iterable (such as a string, a list or a dictionary). Let's look at an example:

In [142]:
sentence = "In a hole in the ground there lived a hobbit."
words = sentence.split()
print(words)

# Example of a list comprehension
word_lengths = [len(word) for word in words]
print(word_lengths)

# This is equal to
word_lengths = []
for word in words:
  word_lengths.append(len(word))

print(word_lengths)

['In', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit.']
[2, 1, 4, 2, 3, 6, 5, 5, 1, 7]
[2, 1, 4, 2, 3, 6, 5, 5, 1, 7]


You can also add a conditional statement to list comprehensions, so that the expression will only be evaluated for items that meet a certain criteria:

In [119]:
e_words = [word for word in words if len(word) > 5]
print(e_words)

['ground', 'hobbit.']


Python also has set and dictionary comprehensions:

In [120]:
lowercase_characters = {c.lower() for c in sentence}
print(lowercase_characters)

word_length = {word: len(word) for word in words}
print(word_length['ground'])

{'u', 'v', 't', 'g', 'n', 'a', 'r', 'h', 'o', 'l', 'b', 'i', '.', 'e', ' ', 'd'}
6


A nested list is a list within another list. You can iterate through nested lists in the following way:

In [121]:
# A list of countries and their capitals within different continents
continents = [
    [('Iceland', 'Reykjavík'), ('Germany', 'Berlin'), ('Spain', 'Madrid')],  # Europe
    [('Japan', 'Tokyo'), ('China', 'Beijing'), ('South Korea', 'Seoul')],  # Asia
    [('Nigeria', 'Abuja'), ('Algeria', 'Algiers'), ('Angola', 'Luanda')]  # Africa
]

# Create a list of all the countries in the previous list
[country for continent in continents for (country, capital) in continent]

['Iceland',
 'Germany',
 'Spain',
 'Japan',
 'China',
 'South Korea',
 'Nigeria',
 'Algeria',
 'Angola']

## Sentiment analysis with NLTK
[Chapter 6](https://www.nltk.org/book/ch06.html) of the NLTK book shows how the toolkit can be used to create document classifiers, including a sentiment analyzer. The NLTK includes the `movie_reviews` corpus, which contains 2,000 movie reviews. Half of the reviews have been labelled as **positive** and the other half as **negative**. Let's download it and take a look:

In [122]:
import nltk
from nltk.corpus import movie_reviews
nltk.download('punkt')

nltk.download('movie_reviews')
print("Categories:", movie_reviews.categories())

Categories: ['neg', 'pos']


[nltk_data] Downloading package punkt to
[nltk_data]     /home/administrator/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/administrator/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


As expected, there are two categories: `pos` for positive reviews and `neg` for negative reviews. For this particular corpus, each review is stored as a separate text file. To get a list of all the text files in the corpus, we can use `movie_reviews.fileids()`. We can also get a list of files for a specific category:

In [123]:
pos_fileids = movie_reviews.fileids('pos')
neg_fileids = movie_reviews.fileids('neg')

print(pos_fileids[:5])  # The first 5 positive reviews
print(neg_fileids[:5])  # The first 5 negative reviews

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt']
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt']


We can get a list of all the tokens in the corpus with `movie_reviews.words()`. We can also specify a filename to get a single tokenized review:

In [124]:
pos_reviews = [movie_reviews.words(fid) for fid in pos_fileids]
neg_reviews = [movie_reviews.words(fid) for fid in neg_fileids]

print(pos_reviews[0][:10])  # The first 10 tokens of the first positive review
print(neg_reviews[0][:10])  # The first 10 tokens of the first negative review

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success']
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party']


Some words, such as *brilliant* and *memorable*, are more strongly associated with positive reviews than negative ones. Similarly, *boring* and *unfunny* have a stronger association with negative reviews.

Using the movie review corpus, we can train a classifier to predict whether a given review is positive or negative. The classifier extracts a set of *features* from every review, which are then used to make the classification. In this case, the features we use will be a dictionary that tells us whether each of the 2,000 most common words in the corpus is present within a review or not.

In [125]:
# Create a set with 2,000 of the most frequent words in the movie review corpus
movie_fd = nltk.FreqDist(movie_reviews.words())
movie_words = {word for word, count in movie_fd.most_common(2000)}

# For a given review (in the form of a list or set of tokens), create a
# dictionary which tells us which words are present and which are not.
def get_review_features(review):
  review_words = set(review)
  return {word: word in review_words for word in movie_words}

In [126]:
# Let's see how this works for the first positive review:
example_features = get_review_features(pos_reviews[0])
print("'funny' is in the review:", example_features['funny'])
print("'boring' is in the review:", example_features['boring'])

'funny' is in the review: True
'boring' is in the review: False


Next, let's create a training set that we can use to train a Naive Bayesian classifier. The training set, in this case, is a list of tuples in the format `[(features, category), ...]`, where `features` is a dictionary from `get_review_features()` and `category` is either `pos` or `neg`, depending on whether the review is positive or negative. To get an idea of how well the classifier performs, we're going to reserve 10% of the reviews for testing. That means that we'll be training our classifier on 1800 examples and testing it on 200 examples.

In [127]:
pos_examples = [(get_review_features(review), 'pos') for review in pos_reviews]
neg_examples = [(get_review_features(review), 'neg') for review in neg_reviews]

movie_training = pos_examples[:900] + neg_examples[:900]  # 1800 examples total
movie_test = pos_examples[900:] + neg_examples[900:]  # 200 examples total

Now we have everything we need to train our classifier.

In [128]:
movie_classifier = nltk.NaiveBayesClassifier.train(movie_training)

How well does it perform on the test set?

In [129]:
print("Accuracy:", nltk.classify.accuracy(movie_classifier, movie_test))

Accuracy: 0.815


The classifier achieves an accuracy of 81.5%. Let's take a look at which words have the biggest weights:

In [130]:
movie_classifier.show_most_informative_features(20)

Most Informative Features
             outstanding = True              pos : neg    =     15.6 : 1.0
                   mulan = True              pos : neg    =      9.0 : 1.0
             wonderfully = True              pos : neg    =      7.1 : 1.0
                  seagal = True              neg : pos    =      7.0 : 1.0
                   damon = True              pos : neg    =      6.1 : 1.0
                   flynt = True              pos : neg    =      5.7 : 1.0
                  wasted = True              neg : pos    =      5.6 : 1.0
                    lame = True              neg : pos    =      5.3 : 1.0
                  poorly = True              neg : pos    =      5.2 : 1.0
                   awful = True              neg : pos    =      4.9 : 1.0
              ridiculous = True              neg : pos    =      4.8 : 1.0
                    jedi = True              pos : neg    =      4.4 : 1.0
                 unfunny = True              neg : pos    =      4.4 : 1.0

# Assignment
Answer the following questions and hand in your solution in Canvas before 8:30 AM Monday, September 11th. Remember to save your file before uploading it.

## Question 1
The NLTK also includes a `subjectivity` corpus, which contains a collection of sentences that have either been categorized as **subjective** (emotional, expressing personal feelings and views)  or **objective** (more rational, factual). Some examples:

* **Objective sentences**:
  * uma thurman stars in quentin tarantino's fourth film venture , kill bill .  
  * he lives in a motor garage with his six friends .
  * the ensuing battle was one of the most savage in u . s . history .
* **Subjective sentences**:
  * seagal's strenuous attempt at a change in expression could very well clinch him this year's razzie .
  * de niro cries . you'll cry for your money back .
  * a heroic tale of persistence that is sure to win viewers' hearts .

Unlike the movie review corpus, where every review is stored in separate file, here there is only one file for each category.

Complete the following tasks:
1. Import and download the `subjectivity` corpus.
2. Find the names of each category.
3. Using the category names, get the relative path of each file.
4. Get a list of tokenized sentences for each category (using `subjectivity.sents(fileid)`).

In [131]:
import nltk
from nltk.corpus import subjectivity
nltk.download('subjectivity')

category_names = subjectivity.categories()
print("There are {} categories: {}".format(len(category_names), category_names))
# There are 2 categories: ['obj', 'subj']

category_files = [subjectivity.fileids(category) for category in category_names]
print("There are {} files in the 'obj' category and {} files in the 'subj' category".format(len(category_files[0]), len(category_files[1])))
# There are 1 files in the 'obj' category and 1 files in the 'subj' category

category_sentences = [subjectivity.sents(fileid) for fileid in category_files]
print("There are {} sentences in the 'obj' category and {} sentences in the 'subj' category".format(len(category_sentences[0]), len(category_sentences[1])))
# There are 5000 sentences in the 'obj' category and 5000 sentences in the 'subj' category
print("Extract of the first 5 sentences in the 'obj' category: {}".format(category_sentences[0][:5]))
'''
Extract of the first 5 sentences in the 'obj' category: [['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi', 'from', 'a', 'hunter', '.'], ['emerging', 'from', 'the', 'human', 'psyche', 'and', 'showing', 'characteristics', 'of', 'abstract', 'expressionism', ',', 'minimalism', 'and', 'russian', 'constructivism', ',', 'graffiti', 'removal', 'has', 'secured', 'its', 'place', 'in', 'the', 'history', 'of', 'modern', 'art', 'while', 'being', 'created', 'by', 'artists', 'who', 'are', 'unconscious', 'of', 'their', 'artistic', 'achievements', '.'], ['spurning', 'her', "mother's", 'insistence', 'that', 'she', 'get', 'on', 'with', 'her', 'life', ',', 'mary', 'is', 'thrown', 'out', 'of', 'the', 'house', ',', 'rejected', 'by', 'joe', ',', 'and', 'expelled', 'from', 'school', 'as', 'she', 'grows', 'larger', 'with', 'child', '.'], ['amitabh', "can't", 'believe', 'the', 'board', 'of', 'directors', 'and', 'his', 'mind', 'is', 'filled', 'with', 'revenge', 'and', 'what', 'better', 'revenge', 'than', 'robbing', 'the', 'bank', 'himself', ',', 'ironic', 'as', 'it', 'may', 'sound', '.'], ['she', ',', 'among', 'others', 'excentricities', ',', 'talks', 'to', 'a', 'small', 'rock', ',', 'gertrude', ',', 'like', 'if', 'she', 'was', 'alive', '.']]
'''
print("Extract of the first 5 sentences in the 'subj' category: {}".format(category_sentences[1][:5]))
'''
Extract of the first 5 sentences in the 'subj' category: [['smart', 'and', 'alert', ',', 'thirteen', 'conversations', 'about', 'one', 'thing', 'is', 'a', 'small', 'gem', '.'], ['color', ',', 'musical', 'bounce', 'and', 'warm', 'seas', 'lapping', 'on', 'island', 'shores', '.', 'and', 'just', 'enough', 'science', 'to', 'send', 'you', 'home', 'thinking', '.'], ['it', 'is', 'not', 'a', 'mass-market', 'entertainment', 'but', 'an', 'uncompromising', 'attempt', 'by', 'one', 'artist', 'to', 'think', 'about', 'another', '.'], ['a', 'light-hearted', 'french', 'film', 'about', 'the', 'spiritual', 'quest', 'of', 'a', 'fashion', 'model', 'seeking', 'peace', 'of', 'mind', 'while', 'in', 'a', 'love', 'affair', 'with', 'a', 'veterinarian', 'who', 'is', 'a', 'non-practicing', 'jew', '.'], ['my', 'wife', 'is', 'an', 'actress', 'has', 'its', 'moments', 'in', 'looking', 'at', 'the', 'comic', 'effects', 'of', 'jealousy', '.', 'in', 'the', 'end', ',', 'though', ',', 'it', 'is', 'only', 'mildly', 'amusing', 'when', 'it', 'could', 'have', 'been', 'so', 'much', 'more', '.']]
'''

There are 2 categories: ['obj', 'subj']
There are 1 files in the 'obj' category and 1 files in the 'subj' category


[nltk_data] Downloading package subjectivity to
[nltk_data]     /home/administrator/nltk_data...
[nltk_data]   Package subjectivity is already up-to-date!


There are 5000 sentences in the 'obj' category and 5000 sentences in the 'subj' category
Extract of the first 5 sentences in the 'obj' category: [['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi', 'from', 'a', 'hunter', '.'], ['emerging', 'from', 'the', 'human', 'psyche', 'and', 'showing', 'characteristics', 'of', 'abstract', 'expressionism', ',', 'minimalism', 'and', 'russian', 'constructivism', ',', 'graffiti', 'removal', 'has', 'secured', 'its', 'place', 'in', 'the', 'history', 'of', 'modern', 'art', 'while', 'being', 'created', 'by', 'artists', 'who', 'are', 'unconscious', 'of', 'their', 'artistic', 'achievements', '.'], ['spurning', 'her', "mother's", 'insistence', 'that', 'she', 'get', 'on', 'with', 'her', 'life', ',', 'mary', 'is', 'thrown', 'out', 'of', 'the', 'house', ',', 'rejected', 'by', 'joe', ',', 'and', 'expelled', 'from', 'school', 'as', 'she', 'grows', 'larger', 'with', 'child', '.'], ['amit

"\nExtract of the first 5 sentences in the 'subj' category: [['smart', 'and', 'alert', ',', 'thirteen', 'conversations', 'about', 'one', 'thing', 'is', 'a', 'small', 'gem', '.'], ['color', ',', 'musical', 'bounce', 'and', 'warm', 'seas', 'lapping', 'on', 'island', 'shores', '.', 'and', 'just', 'enough', 'science', 'to', 'send', 'you', 'home', 'thinking', '.'], ['it', 'is', 'not', 'a', 'mass-market', 'entertainment', 'but', 'an', 'uncompromising', 'attempt', 'by', 'one', 'artist', 'to', 'think', 'about', 'another', '.'], ['a', 'light-hearted', 'french', 'film', 'about', 'the', 'spiritual', 'quest', 'of', 'a', 'fashion', 'model', 'seeking', 'peace', 'of', 'mind', 'while', 'in', 'a', 'love', 'affair', 'with', 'a', 'veterinarian', 'who', 'is', 'a', 'non-practicing', 'jew', '.'], ['my', 'wife', 'is', 'an', 'actress', 'has', 'its', 'moments', 'in', 'looking', 'at', 'the', 'comic', 'effects', 'of', 'jealousy', '.', 'in', 'the', 'end', ',', 'though', ',', 'it', 'is', 'only', 'mildly', 'amusing

## Question 2
Complete the following tasks:
1. Create a set with the 2,000 most common words in the `subjectivity` corpus using `nltk.FreqDist()`.
2. Create a function that takes a single, tokenized sentence as input (e.g., `['the', 'ensuing', 'battle', ...]`), and returns a dictionary of the 2,000 most frequent words and whether or not they are in the sentence (e.g., `{'battle': True, 'amusing': False, ...}`).

In [132]:
subjectivity_frequency_distribution = nltk.FreqDist(subjectivity.words())
print("There are {} unique words in the 'subjectivity' corpus".format(len(subjectivity_frequency_distribution)))
# There are 23906 unique words in the 'subjectivity' corpus

top_2000_words = {word for word, count in subjectivity_frequency_distribution.most_common(2000)}
print("The 2000 most common words are: {}".format(top_2000_words))
# Note: output omitted from the code, but the set contains the 2000 most common words

def to_word_occurrence_dictionary(tokenized_sentence, top_words=top_2000_words) -> dict[str, int]:
    word_occurrence_dictionary = dict()
    for top_word in top_words:
        word_occurrence_dictionary[top_word] = len([word for word in tokenized_sentence if word == top_word])
    return word_occurrence_dictionary

def to_word_presence_dictionary(tokenized_sentence, top_words=top_2000_words) -> dict[str, bool]:
    word_occurrence_dictionary = to_word_occurrence_dictionary(tokenized_sentence, top_words)
    word_presence_dictionary = {word: word_occurrence_dictionary[word] > 0 for word in word_occurrence_dictionary.keys()}
    return word_presence_dictionary

There are 23906 unique words in the 'subjectivity' corpus
The 2000 most common words are: {'escapes', 'stops', 'hollywood', 'ill', 'the', 'peace', 'friends', 'adult', 'possibly', 'single', 'intellectual', 'seemingly', 'somewhat', 'perhaps', 'kung', 'dealer', 'too', 'k', 'strong', 'goofy', 'victims', 'stone', 'ready', 'others', 'revolution', 'faced', 'ship', 'completely', 'turned', 'following', 'play', 'far', 'date', 'spend', 'running', 'lovers', 'about', 'sons', 'sean', 'pieces', 'appears', 'heart', 'finish', 'home', 'cell', 'seeking', 'band', 'emotionally', 'adults', 'seeks', 'beneath', 'then', 'account', 'expect', 'police', 'inner', 'suddenly', 'somewhere', 'beyond', 'force', 'entire', 'india', 'abandoned', 'enough', 'whose', 'leading', 'land', 'someone', 'turns', 'opportunity', 'touch', 'constructed', 'sets', 'y', 'returns', 'learn', 'jane', 'ultimate', 'rural', 'runs', 'dumb', 'ways', 'yourself', 'ugly', 'students', 'spectacle', 'points', 'otherwise', 'involved', '5', 'tedious', 's

## Question 3
Complete the following tasks:
1. Create a training set with 9,000 sentences (4,500 of each category)
2. Create a test set with 1,000 sentences (500 of each category)

In [133]:
training_objective_set = [to_word_presence_dictionary(sentence) for sentence in category_sentences[0][:4500]]
training_subjective_set = [to_word_presence_dictionary(sentence) for sentence in category_sentences[1][:4500]]
training_set = [(sentence, 'obj') for sentence in training_objective_set] + [(sentence, 'subj') for sentence in training_subjective_set]

testing_objective_set = [to_word_presence_dictionary(sentence) for sentence in category_sentences[0][4500:]]
testing_subjective_set = [to_word_presence_dictionary(sentence) for sentence in category_sentences[1][4500:]]
testing_set = [(sentence, 'obj') for sentence in testing_objective_set] + [(sentence, 'subj') for sentence in testing_subjective_set]

## Question 4
Complete the following tasks:
1. Train a Naive Bayes classifier using the training set from the previous question.
2. Evaluate the classifier on the test set. How accurate is it?
3. Find the 20 most informative features.

In [134]:
subjectivity_classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Train Accuracy:", nltk.classify.accuracy(subjectivity_classifier, training_set))
# Train Accuracy: 0.9198888888888889
print("Accuracy:", nltk.classify.accuracy(subjectivity_classifier, testing_set))
# Accuracy: 0.906

top_20_most_informative_features = subjectivity_classifier.most_informative_features(20)
print("The 20 most informative features are: {}".format(top_20_most_informative_features))


Train Accuracy: 0.9198888888888889
Accuracy: 0.906
The 20 most informative features are: [('--', True), ('order', True), ('decides', True), ('sister', True), ('entertaining', True), ('girlfriend', True), ('discover', True), ("film's", True), ("you're", True), ('daughter', True), ('married', True), ('amusing', True), ('plans', True), ('probably', True), ('plan', True), ('town', True), ("you've", True), ('kill', True), ('slow', True), ('interesting', True)]


# Question 5
Dialog acts are sort of the type of *action* performed by the speaker. In the instant messaging corpus dataset 'NPS', each utterance is labeled with one of 15 dialogue act types, such as **Statement**, **Emotion**, **ynQuestion**, **Continuer**, etc.

Your task is to classify text from the NPS corpus into two dialog acts: **whQuestion** or **Emotion**.

Start by downloading the NPS corpus and getting all posts from the corpus:

In [135]:
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()

[nltk_data] Downloading package nps_chat to
[nltk_data]     /home/administrator/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!


Create a list that only includes posts of class **Emotion** and **whQuestion**. You can access the class of a post by calling `post.get("class")`.

In [136]:
emotion_posts = [post for post in posts if post.get("class") == "Emotion"]
wh_question_posts = [post for post in posts if post.get("class") == "whQuestion"]
print("There are {} posts of class 'Emotion' and {} posts of class 'whQuestion'".format(len(emotion_posts), len(wh_question_posts)))
# There are 1106 posts of class 'Emotion' and 533 posts of class 'whQuestion'
# NOTE: the dataset is unbalanced, this may lead to classification errors

classified_posts = {post: post.get("class") for post in emotion_posts + wh_question_posts}
classified_posts = {key: value for key, value in sorted(classified_posts.items(), key=lambda x: x[1])}
print("The total amount of classified posts is {}".format(len(classified_posts)))


There are 1106 posts of class 'Emotion' and 533 posts of class 'whQuestion'
The total amount of classified posts is 1639


Randomize the posts and create a training set and a test set, where the first 1300 **Emotion + whQuestion** posts are used for training and the rest for testing.

In [137]:
import random

random.seed(1234)
# NOTE: seed add only to make results reproducible

randomized_posts = list(classified_posts.items())
random.shuffle(randomized_posts)

training_set = randomized_posts[:1300]
testing_set = randomized_posts[1300:]
print("There are {} posts in the training set and {} posts in the test set".format(len(training_set), len(testing_set)))
# There are 1300 posts in the training set and 339 posts in the test set

There are 1300 posts in the training set and 339 posts in the test set


Create a list of the 200 most frequent tokens in the training set. You can access the text of a `post` object by calling `post.text`. Remember that the **split** function will use whitespace to tokenize a string: `some_string.split()`

In [138]:
all_tokens_in_train_set = []
for post, _ in training_set:
    all_tokens_in_train_set += nltk.word_tokenize(post.text)
print("There are {} tokens in the training set".format(len(all_tokens_in_train_set)))
# There are 4944 tokens in the training set
print("The first 5 tokens in the training set are: {}".format(all_tokens_in_train_set[:5]))
# The first 5 tokens in the training set are: ['what', 'happened', 'last', 'night', '?']

frequency_distribution = nltk.FreqDist(all_tokens_in_train_set)
top_200_tokens_posts = [token for token, _ in frequency_distribution.most_common(200)]
print("The 200 most frequent tokens are: {}".format(top_200_tokens_posts))

There are 4944 tokens in the training set
The first 5 tokens in the training set are: ['what', 'happened', 'last', 'night', '?']
The 200 most frequent tokens are: [')', '?', '(', 'lol', '!', 'what', 'you', 'how', 'lmao', ':', 'who', 'are', 'the', 'is', 'to', ',', 'LOL', 'haha', 'u', 'why', 'where', 'and', 'whats', '.', '...', '-', ']', 'do', 'in', ';', 'up', 'did', "'s", 'a', 'so', '..', '[', 'it', 'here', 'omg', 'that', 'from', '....', 'me', 'about', 'for', 'was', 'What', 'I', 'damn', 'LMAO', 'of', 'everyone', '<', 'LoL', '11-09-40sUser18', 'r', 'we', '@', 'your', 'there', 'wants', 'i', 'on', 'ya', 'have', 'good', 'ha', '10-19-adultsUser23', 'all', '10-19-adultsUser35', 'asl', 'hey', 'been', 'like', '11-08-40sUser48', '11-08-adultsUser65', 'ok', 'when', 'hahah', '*', 'How', 'doing', '10-19-40sUser9', '10-24-40sUser16', 'Lol', 'today', 'oh', 'talk', '11-06-adultsUser105', 'they', '11-09-40sUser7', '11-09-40sUser48', '10-19-30sUser31', '11-09-40sUser30', '10-19-20sUser121', 'not', 'name



Define two feature selection functions that take a string as input and output a dictionary of features:
* `get_word_features(string)`
* `get_custom_features(string)`

Begin by defining `get_word_features`. This function should use the words as features, just like in the movie review example above.




In [139]:
def get_word_features(string):
    tokens_from_string = nltk.word_tokenize(string)
    return to_word_presence_dictionary(tokens_from_string, top_200_tokens_posts)

Next, define `get_custom_features`. This function should extract the features from the text that characterize the **Emotion** and **whQuestions** classes.

In [140]:
def get_custom_features(string):
    custom_features = dict()
    custom_features["contains_?"] = "?" in string
    custom_features["contains_!"] = "!" in string
    custom_features["post_length"] = len(string) > 35
    # NOTE: Threshold set manually, could be optimized
    custom_features["contains_emoticon"] = any([token in string for token in [":)", ":(", ":D", ":P", ":/", ":|", ":O", ":S", ":*", ":'("]])
    custom_features["contains_laugh"] = any([token in string for token in ["haha", "hahaha", "lol", "lmao", "rofl"]])
    custom_features["contains_wh"] = any([token in string for token in ["what", "when", "where", "who", "why", "how"]])
    custom_features["contains_emotion"] = any([token in string for token in ["love", "hate", "like", "dislike", "sad", "happy", "angry", "mad", "annoyed", "excited", "bored", "scared", "fear", "afraid", "surprised", "surprise", "disgusted", "disgust", "shocked", "shock", "confused", "confuse", "confusing", "confusion", "depressed", "depress", "depressing", "depression", "anxious", "anxiety", "anxious", "anxiously"]])
    return custom_features

Conduct the following tasks:
*   Train two Naive Bayes classifiers on the **Emotion + whQuestions** training set: one that uses the `get_word_features` function and another using `get_custom_features`.
*   Evaluate each classifier on the test set. How accurate are they? Which one is better?
*   What are the 20 most informative features for each classifier?


In [141]:
bayes_classifier_training_set_word_features = [(get_word_features(post.text), post_class) for post, post_class in training_set]
bayes_classifier_testing_set_word_features = [(get_word_features(post.text), post_class) for post, post_class in testing_set]

word_features_naive_bayes_classifier = nltk.NaiveBayesClassifier.train(bayes_classifier_training_set_word_features)
print("Train Accuracy:", nltk.classify.accuracy(word_features_naive_bayes_classifier, bayes_classifier_training_set_word_features))
print("Accuracy:", nltk.classify.accuracy(word_features_naive_bayes_classifier, bayes_classifier_testing_set_word_features))
print("The 20 most informative features are: {}".format(word_features_naive_bayes_classifier.most_informative_features(20)))


bayes_classifier_training_set_custom_features = [(get_custom_features(post.text), post_class) for post, post_class in training_set]
bayes_classifier_testing_set_custom_features = [(get_custom_features(post.text), post_class) for post, post_class in testing_set]

custom_features_naive_bayes_classifier = nltk.NaiveBayesClassifier.train(bayes_classifier_training_set_custom_features)
print("Train Accuracy:", nltk.classify.accuracy(custom_features_naive_bayes_classifier, bayes_classifier_training_set_custom_features))
print("Accuracy:", nltk.classify.accuracy(custom_features_naive_bayes_classifier, bayes_classifier_testing_set_custom_features))
print("The 20 most informative features are: {}".format(custom_features_naive_bayes_classifier.most_informative_features(20)))
# NOTE: Lower train accuracy but higher test accuracy for custom features

Train Accuracy: 0.9830769230769231
Accuracy: 0.9852507374631269
The 20 most informative features are: [('how', True), ('u', True), ('you', True), ('the', True), ('and', True), ('up', True), ("'s", True), ('so', True), ('lmao', True), ('from', True), ('me', True), ('to', True), ('lol', True), ('everyone', True), ('it', True), (')', True), ('(', True), ('hey', True), ('we', True), ('ok', True)]
Train Accuracy: 0.9746153846153847
Accuracy: 0.9882005899705014
The 20 most informative features are: [('contains_wh', True), ('contains_laugh', True), ('contains_emoticon', True), ('contains_emotion', True), ('post_length', True), ('contains_wh', False), ('contains_?', False), ('contains_laugh', False), ('contains_!', True), ('post_length', False), ('contains_emoticon', False), ('contains_!', False), ('contains_emotion', False), ('contains_?', True)]
