# POS-tagging for comparative/superlative identification

__Contents__

0. [Start the Stanford CoreNLP server](#Start-the-Stanford-CoreNLP-server)
0. [Convenience function for POS tagging](#Convenience-function-for-POS-tagging)
0. [Comparative/Superlative identifiers](#Comparative/Superlative-identifiers)
0. [Data analysis](#Data-analysis)
  0. [Tag the data](#Tag-the-data)
  0. [Identify comparatives and superlatives](#Identify-comparatives-and-superlatives)
  0. [Inspection](#Inspection)

In [18]:
import json
import os
import pandas as pd
import nltk as nltk
from pycorenlp import StanfordCoreNLP

## Start the Stanford CoreNLP server

Before running this notebook, [get CoreNLP](http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip), go into its directory, and run

`java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000`

If you're using port 9000 for something else, change that value and then change `PORT` in the next cell.

In [10]:
PORT = 9000

nlp = StanfordCoreNLP('http://localhost:{}'.format(PORT))

## Convenience function for POS tagging

In [47]:
def stanford_pos(text):
    """
    Parameters
    ----------
    text : str
       CoreNLP handles all tokenizing, at the sentence and word level.
       
    Returns
    -------
    list of tuples (str, str)
       The first member of each pair is the word, the second its POS tag.          
    """
    if not isinstance(text, basestring):
        print '%s: %s' % (type(text), str(text))
    try:
        #text = str(text)
        ann = nlp.annotate(
            text, 
            properties={'annotators': 'pos', 
                        'outputFormat': 'json'})
    except Exception as e:
        print text
        raise
    lemmas = []
    if isinstance(ann, basestring):
        ann = json.loads(ann.replace('\x00', '?').encode('latin-1'), encoding='utf-8', strict=True)
    for sentence in ann['sentences']:
        for token in sentence['tokens']:
            lemmas.append((token['word'], token['pos']))
    return lemmas

## Comparative/Superlative identifiers

In [12]:
from nltk.stem.wordnet import WordNetLemmatizer

LEMMATIZER = WordNetLemmatizer()

def is_comp_sup(word, pos, tags, check_lemmatizer=False):
    """
    Parameters
    ----------
    word, pos : str, str
        The lemma.
    
    tags : iterable of str
        The tags considered positive evidence for comp/sup morphology.
       
       
    check_lemmatizer : bool
        If True, then if the `pos` is in `tags`, we also check that
        `word` is different from the lemmatized version of word
        according to WordNet, treating it as an adjective. This 
        could be used to achieve greater precision, perhaps at the
        expense of recall.
       
    Returns
    -------
    bool       
    """
    if pos not in tags:
        return False
    if check_lemmatizer and LEMMATIZER.lemmatize(word, 'a') == word:
        return False
    return True

def is_superlative(word, pos, check_lemmatizer=False):
    return is_comp_sup(
        word, pos, {'JJS', 'RBS'}, check_lemmatizer=check_lemmatizer)

def is_comparative(word, pos, check_lemmatizer=False):
    return is_comp_sup(
        word, pos, {'JJR', 'RBR'}, check_lemmatizer=check_lemmatizer)

## Data analysis

In [5]:
d_human = (pd.read_csv('humanOutput/colorReferenceMessage2.csv')
     .assign(source = 'human'))
d_model = (pd.read_csv('modelOutput/speaker_big_s0_untuned_message.csv')
     .assign(source = 'model'))
d = d_human.append(d_model)
d

Unnamed: 0,gameid,time,roundNum,sender,contents,source
0,1124-1,1459877203862,1,speaker,The darker blue one,human
1,1124-1,1459877214034,2,speaker,purple,human
2,1124-1,1459877223719,3,speaker,Medium pink,human
3,1124-1,1459877227433,3,speaker,the medium dark one,human
4,1124-1,1459877240480,4,speaker,lime,human
5,1124-1,1459877257997,5,speaker,Mint green.,human
6,1124-1,1459877267242,6,speaker,Mud brown,human
7,1124-1,1459877278380,7,speaker,Mud brown,human
8,1124-1,1459877294720,8,speaker,Camo green,human
9,1124-1,1459877305438,9,speaker,Darkish red,human


### Tag the data

In [48]:
stanford_pos('\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x81\xe4\xbd\xa0\xe5\xa5\xbd\xe5\x90\x97'.decode('utf-8'))

[(u'\u4f60\u597d', u'NN'), (u'\uff01', u'CD'), (u'\u4f60\u597d\u5417', u'CD')]

In [50]:
# A lemma is a (word, pos) tag pair.
d['lemmas'] = [stanford_pos(text.decode('utf-8')) for text in d['contents']]

### Identify comparatives and superlatives

These steps put a 1 in the position of comparatives/superlatives, and a 0 in all other places, to maintain alignment with the original texts.

In [15]:
d['superlatives'] = [[1 if is_superlative(*lem) else 0 for lem in lemmas]
                     for lemmas in d['lemmas']]

In [52]:
d['comparatives'] = [[1 if is_comparative(*lem) else 0 for lem in lemmas]
                     for lemmas in d['lemmas']]

Count superlatives & comparatives

In [53]:
d['numSuper'] = [sum(counts) for counts in d['superlatives']]

d['numComp'] = [sum(counts) for counts in d['comparatives']]

### Inspection

Run the cell below to allow for non-scrolling display:

In [54]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

<IPython.core.display.Javascript object>

In [55]:
d.query('numComp > 0').head()

Unnamed: 0,gameid,time,roundNum,sender,contents,source,lemmas,superlatives,comparatives,numSuper,numComp
0,1124-1,1459877203862,1,speaker,The darker blue one,human,"[(The, DT), (darker, JJR), (blue, JJ), (one, NN)]","[0, 0, 0, 0]","[0, 1, 0, 0]",0,1
13,1124-1,1459877360202,13,speaker,"One of the brown ones, the lighter shaded one",human,"[(One, CD), (of, IN), (the, DT), (brown, JJ), ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0]",0,1
14,1124-1,1459877388314,14,speaker,The more vibrantly red one.~~~~~~ not the more...,human,"[(The, DT), (more, JJR), (vibrantly, RB), (red...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",0,2
31,1124-1,1459877544164,26,speaker,darker red,human,"[(darker, JJR), (red, NN)]","[0, 0]","[1, 0]",0,1
33,1124-1,1459877564218,28,speaker,"purple, darker one",human,"[(purple, JJ), (,, ,), (darker, JJR), (one, CD)]","[0, 0, 0, 0]","[0, 0, 1, 0]",0,1


In [56]:
d.query('numComp > 0 & source == "model"').head()

Unnamed: 0,gameid,time,roundNum,sender,contents,source,lemmas,superlatives,comparatives,numSuper,numComp
148,1369-5,1476491571250,34,speaker,darker green ~ darker green,model,"[(darker, JJR), (green, JJ), (~, NN), (darker,...","[0, 0, 0, 0, 0]","[1, 0, 0, 1, 0]",0,2
190,3421-f,1476490157112,26,speaker,the brighter green ~ the brighter green,model,"[(the, DT), (brighter, JJR), (green, JJ), (~, ...","[0, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 1, 0]",0,2
230,3498-1,1476486559698,16,speaker,the brighter green ~ the brighter green,model,"[(the, DT), (brighter, JJR), (green, JJ), (~, ...","[0, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 1, 0]",0,2
322,0699-d,1476486748930,8,speaker,the most purple one ~ the brighter purple one ...,model,"[(the, DT), (most, RBS), (purple, JJ), (one, C...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]",1,1
450,8080-9,1476490120019,36,speaker,the more muted of the two are similar colors ~...,model,"[(the, DT), (more, RBR), (muted, JJ), (of, IN)...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",0,3


### Write to file

In [66]:
(d.drop(['superlatives', 'comparatives'], 1)
 .to_csv("taggedColorMsgs2.csv", index = False))

In [58]:
len(d_model[d_model['contents'].str.contains('not ')]) * 1.0 / len(d_model)

0.007874962428614367

In [59]:
def join_with_tilde(s):
    return ' ~ '.join(s)

def join_lemmas_lists(r):
    result = []
    for row in r:
        result.extend(eval(row))
    return repr(result)

def join_with_tagged(output, tags):
    return (pd.merge(output, tags, on=['gameid', 'roundNum'])[['gameid', 'roundNum', 'contents', 'numSuper', 'numComp', 'condition']])

In [62]:
joined = join_with_tagged(pd.read_csv('humanOutput/colorReferenceClicks2.csv', escapechar='\\'),
                          pd.read_csv("taggedColorMsgs2.csv", escapechar='\\'))

In [63]:
for condition in ('closer', 'further', 'equal'):
    filtered = joined.query('condition == "%s"' % condition)
    print '%s comp: %s' % (condition, filtered['numComp'].sum() * 1.0 / len(filtered))
    print '%s neg: %s' % (condition, len(filtered[filtered['contents'].str.contains('not ')]) * 1.0 / len(filtered))
    print '%s super: %s' % (condition, filtered['numSuper'].sum() * 1.0 / len(filtered))    

closer comp: 0.136145192901
closer neg: 0.106881091095
closer super: 0.167825596692
further comp: 0.135046567782
further neg: 0.0807174887892
further super: 0.0524893641486
equal comp: 0.0186192533648
equal neg: 0.0234372571571
equal super: 0.0186192533648
