In [17]:
!pwd

/data/home/arclight/notebooks/workshop/intro-to-nlp-with-pytorch/Sarcasm_Detection/algorithm


## Max Entropy Algorithm

The Max Entropy classifier is a probabilistic classifier which belongs to the class of exponential models. 

The Max Entropy does not assume that the features are conditionally independent of each other. 

The MaxEnt is based on the Principle of Maximum Entropy and from all the models that fit our training data, selects the one which has the largest entropy. The Max Entropy classifier can be used to solve a large variety of text classification problems such as language detection, topic classification, sentiment analysis and more.

![img](./dogsvsfriedchicken.png)


Let a feature function, f_i(x), take in an input, x, and return either 0 or 1, depending if the feature is present in x:

f(x) = \begin{cases} 1, & \quad \mbox{if the feature is present in } x\\ 0, & \quad \mbox{otherwise}\\ \end{cases} 

Furthermore, for N features, associate each feature function f_i(x) with a weight w_i(d), which is a number that denotes how “important” f_i(x) is compared to other features for a decision, d (In this case, spam or not spam).

We can “model” (in my opinion, this word could be understood as “estimate”) the score of a decision d on input x using the following procedure:

- For each f_i(x) in a set of N features, determine if f_i(x) should be 1 or 0
- Multiply each f_i(x) with the associated weight w_i(d), which depends on the decision d being evaluated.
- Add up all of the weight*feature pairs: sum_d = \sum_{i=1}^{N} w_i(d)*f_i(x)
- Throw the sum up into an exponent: numerator_d = \exp(sum_d) 
- Divide the sum by a number that will range the score between 0 and 1, and such that the sum of scores across all decisions is 1. It turns out that this is the sum of the numerators for every possible decision d: denominator = \sum_{d} \exp(\sum_{i=1}^{N} w_i(d)*f_i(x))
- The procedure above is pretty much the equation below:

![img](./maxentequation.png)

# Prerequisites

## Install the MegaM library

- Make sure the Punkt Sentence Tokenizer is installed
   - nltk.download('punkt')
- Install MegaM library used by NLTK for Max Entropy algorithm
   - wget http://caml.inria.fr/pub/distrib/ocaml-4.02/ocaml-4.02.1.tar.gz
   - tar -zxvf ocaml-4.02.1.tar.gz
   - ./configure
   - make world.opt
   - sudo make install
   - wget http://hal3.name/megam/megam_src.tgz
   - tar -zxvf megam_src.tgz
   - cd megam_0.92
   - Run `ocamlc -where` and note down the path
   - Edit the Makefile 74 line
     - #WITHCLIBS =-I /usr/lib/ocaml/caml
     - WITHCLIBS =-I /usr/local/lib/ocaml/caml
   - Edit the Makefile again, change the 62 line -lstr to -lcamlstr
     - #WITHSTR =str.cma -cclib -lstr
     - WITHSTR =str.cma -cclib -lcamlstr
   - Run `make`

In [14]:
import sys
import nltk
import nltk.data
from nltk.metrics.scores import (accuracy, precision, recall, f_measure,
                                          log_likelihood, approxrand)
from nltk import precision
import random
from nltk import classify
from nltk.classify import MaxentClassifier
from nltk.classify.megam import call_megam, write_megam_file, parse_megam_weights
from nltk.corpus import names
import collections,re
import csv
import json,os

In [15]:
train_data = "train_set_v2.json"
test_data = "test_set_v2.json"

nltk.data.load('nltk:tokenizers/punkt/english.pickle')
nltk.download('averaged_perceptron_tagger')
#os.environ["MEGAM"] = '/usr/local/Cellar/megam/0.9.2/bin/megam'

all_features = ["words","length","pos","interjection","question"]
metrics = {}
def feature_set_generator(text,length,label, include_list):
    features = {}
    words = text.split()

    if not include_list:
        include_list = all_features

    # Bag of words
    if("words" in include_list):
        features["words"] = tuple((word,True) for word in words)

    # Length
    if("length" in include_list):
        features["length"] = length

    # Part of speech tagging
    pos = nltk.word_tokenize(text)
    if("pos" in include_list):
        set_of_pos_tags = nltk.pos_tag(pos)
        features["pos"] = tuple(t for t in set_of_pos_tags)


    # Interjections - SUBSTANTIAL INCREASE IN ACCURACY
    if("interjection" in include_list):
        set_of_pos_tags = nltk.pos_tag(pos)
        interjection_tags = 0
        for tag in set_of_pos_tags:
            if tag == "UH":
                interjection_tags += 1
        features["interjection"] = interjection_tags

    if("question" in include_list):
        question_count = 0
        for text in words:
            if "?" in text:
                question_count += 1
        features["question"] = question_count

    return features

def me_classifier(exclude_list):
    me_classifier = 0

    with open(train_data, 'r',encoding='utf-8', errors='ignore') as csvfile:
        reader = csv.reader(csvfile)
        feature_set = [(feature_set_generator(text,length,label,exclude_list),label) for text,length,label in reader]
        #print(feature_set)
        me_classifier = MaxentClassifier.train(feature_set,"megam")

    accuracy = 0.0
    with open(test_data,'r',encoding='utf-8', errors='ignore') as testcsvfile:
        test_reader = csv.reader(testcsvfile)
        test_feature_set = [(feature_set_generator(text,length,label,exclude_list),label) for text,length,label in test_reader]
        accuracy = classify.accuracy(me_classifier, test_feature_set)

    classified = collections.defaultdict(set)
    observed = collections.defaultdict(set)
    i=1
    with open(test_data,'r',encoding='utf-8', errors='ignore') as testcsvfile:
        test_reader = csv.reader(testcsvfile)
        for text,length,label in test_reader:
            observed[label].add(i)
            classified[me_classifier.classify(feature_set_generator(text,length,label,exclude_list))].add(i)
            i+=1

    return accuracy,precision(observed["1"], classified["1"]),recall(observed['1'], classified['1']),\
           f_measure(observed['1'], classified['1']),precision(observed['0'], classified['0']),recall(observed['1'], classified['0']),f_measure(observed['1'], classified['0'])


def print_stats(a,ps,rs,fs,pns,rns,fns):
    print()
    print("****************** MAX ENTROPY STATISTICS******************************")
    print('Accuracy:', a)
    print('Sarcasm precision:', ps)
    print('Sarcasm recall:', rs)
    print('Sarcasm F-measure:', fs)
    print('Not Sarcasm precision:',pns)
    print('Not Sarcasm recall:', rns)
    print('Not Sarcasm F-measure:', fns)
    print("***********************************************************************")


def prepare_dict(dict,a,ps,rs,fs,pns,rns,fns):
    dict = {}
    dict["title"] = "Maximum Entropy with all features"
    dict["accuracy"] = a
    dict["sarcasm_precision"] = ps
    dict["sarcasm_recall"] = rs
    dict["sarcasm_f_measure"] = fs
    dict["not_sarcasm_precision"] = pns
    dict["not_sarcasm_recall"] = rns
    dict["not_sarcasm_f_measure"] = fns
    return dict



[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/arclight/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [16]:
a,ps,rs,fs,pns,rns,fns = me_classifier([])
max_ent_with_all_features = {}
metrics["max_ent_with_all_features"]=prepare_dict(max_ent_with_all_features,a,ps,rs,fs,pns,rns,fns)
print_stats(a,ps,rs,fs,pns,rns,fns)


****************** MAX ENTROPY STATISTICS******************************
Accuracy: 0.6024705221785513
Sarcasm precision: 0.6297335203366059
Sarcasm recall: 0.19445647466435687
Sarcasm F-measure: 0.2971542025148908
Not Sarcasm precision: 0.5982721382289417
Not Sarcasm recall: 0.8055435253356431
Not Sarcasm F-measure: 0.5361003026372676
***********************************************************************


# Exercise 1 

## Try MaxEnt Classifier with just Parts of Speech words and inspect the metrics



# Exercise 2 

## Try MaxEnt Classifier with only interjection and inspect the metrics

# Exercise 3 

## Inspect data and note down what could improve accuracy. Do sarcastic sentences have a "?" character often or are they phrased as a question? Rhetorical questions often resemble sarcastic sentences