#Data Extraction for SEMRYE project

Extract word pairs of rhymes in poems. Current for the collections of Shakespeare and Housman available in https://github.com/sravanareddy/rhymedata, courtesy of Sravana Reddy.


## Output Format

Output format would be string tuples: 

> ( word1:str, word2:str, rhyme pattern:str, rhyme letter:str, poetm title:str, section id:str, author:str)

**section id** starts from 0

For example,

**Input:**

>TITLE A Lover's Complaint

>RHYME a b a b b c c

>From off a hill whose concave womb reworded

>A plaintful story from a sistering vale,

>My spirits to attend this double voice reworded,

>And down I laid to list the sad-tuned tale;

>Ere long espied a fickle maid full pale, 

>Tearing of papers, breaking rings a-twain,

>Storming her world with sorrow's wind and rain.

> ...

For the full scheme of the data, see: https://github.com/sravanareddy/rhymedata

** Output:**

*(reworded, reworded, ababbcc, a, A Lover's Complaint, 1, Shakespeare )*

*(vale, tale, ababbcc, b, A Lover's Complaint, 1,  Shakespeare )*

*(vale, pale, ababbcc, b, A Lover's Complaint, 1, Shakespeare )*

*(tale, pale, ababbcc, b, A Lover's Complaint, 1, Shakespeare )*


...

** Note0** We may also want to keep a mapping **section_map**
- key: triple (poetm title <str>, section id <str>,  author <str>) 
- value: (section text, rhyme pattern)



**Note1** For now, if a rhyme involves multiple words, like the **b** rhyem above, which as 3 words *vale*, *tale*, and *pale*. We extract all the pairs and keep the partial order, i.e.  *(vale, tale)*, *(tale, pale)* and *(vale, pale)* for the example

**Note2** a special scheme used in the original data is 
> Sometimes, we use a shorthand for the rhyme scheme, like 

>RHYME a a *

>This denotes the rhyme scheme aabbccdd...

We elaborate such patterns in this case.




## Code Design

### Format of input data
The format of original data is quite regular
Each file dedicates to one author, which is specified in the first line

**poem structure**:
>an empty line

>title

>sections*

**section strucre**:

>empty line

>rhyme pattern

>empty line

>poem line*

> **UPDATE** It is NOT ture that each section starts with rhyme pattern, consecutive sections may share the same pattern, which is only specified before the 1st section.  See line 734 of Shakespear.txt for example

Poem lines are consecutive lines with content.

**Rhyme is at the last word of a poem line**


### Format of internal data

(discarded) **collections**: a list of **poem**, which is a tuple of (**poem title**, **list_of_section**)
(discarded) **list_of_section**: a list of **section**, each of which is a tuple (**section id**, **section text**, line_count)

#### Two outputs 

1. **word_pair_list**: a list of word pairs, each of the format **(word1 <str>, word2 <str>, rhyme pattern <str>, rhyme letter <str>, poetm title <str>, section id <str>, author <str> )**

2. **section_map**: a dictionary. key: (**poem title**, **section id**)  value: tuple(**section text** <list of list of string, i.e. splitted>, **rhyme pattern** <string>)


### Functions
Looks like there is no need for internal states offered by class. So the code will be procedure-based.


def **poem extract (file_name)**:
    1. get the *author_name* string
    
    loops for finding the poem
        get the *poem title*
        main *section_count*
        
        loop for finding sections
            get the *rhyme string*
            get *section text* and line count
            add entry to *section_map*
            
            call **parse_rhyme_pattern**, get rhyme pattern
            call **pair_extraction**, get word pair tuple, insert to *word_pair_list*
            

    return  word_pair_list, section_map, author_name 


def **parse_rhyme(rhyme string, line_count)**:
    
    return  new_rhyme_str, index_pair_list

Format of index_pair_list: (index1, index2, rhyme_letter)


def **pair_extraction(index_pair_list, section_text)** :
    return raw_pair_list
    

    
Format of each pair in raw_pair_list:  (word1, word2, letter)



In [34]:
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from __future__ import print_function   # try to be python3 compatible, but it does not always work.
import string
import codecs #might be useful for dealing with poems in other languages
import cPickle as pickle

def poem_extract(poem_file):
    
    # two output data structure
    section_map = {}
    word_pair_list = []
    
    debug = False
    
    #with open(poem_file, 'r') as f: # as it's pure ascii
    with codecs.open(poem_file, 'rU','utf-8') as f:  #change to it when handling with non-ascii langauge...
        lines = f.readlines()
        author_line = lines[0].split()
        lines = lines[1:]
        
    assert author_line[0]== 'AUTHOR'
    author_str = " ".join(author_line[1:])
    
    seen_text =   False
    #seen_rhyme = False
    section_text = []
    section_count = 0
    rhyme, poem_title = None, None
    
    total_line_count, total_section_count, total_title_count = 0, 0, 0
    
    for line_id, line in enumerate(lines):
        
        #print ('line num', line_id+2, ':', line.strip() ) 
        
        if line.split()==[]:
            #seen_rhyme = False
            if seen_text:
                
                assert  len(section_text) > 0
                assert rhyme
                assert poem_title
                
                total_section_count += 1                
              
                section_line_count = len(section_text)
                rhyme_str, index_pair_list = parse_rhyme(rhyme, section_line_count) # parse rhyme pattern
                
                section_map[(poem_title, section_count)] = (section_text, rhyme_str) # update section_map
                
                raw_pair_list = pair_extraction(index_pair_list, section_text) # extract word pairs in the rhyme
                
                extended_pair_list = [ (word1 , word2 , rhyme_str , rhyme_letter , poem_title , section_count , author_str ) for word1, word2, rhyme_letter in raw_pair_list ]
                word_pair_list.extend(extended_pair_list) # update word_pair_list
        
                if debug:
                    for word1, word2, rhyme_letter in raw_pair_list:
                        print ('record:', word1, word2, rhyme_letter)
        
                section_count += 1
                section_text = []
                section_line_count = 0
                
                # this logic of following two section does not hold, as line 724 in shakespear, where two consecutive sections with only one rhyme specified
                #rhyme = None
                #seen_rhyme = False
                seen_text = False
                
                
                
        
        elif line.split()[0] == 'TITLE':
            poem_title = " ".join(line.split()[1:])
            #seen_title = True           
            total_title_count += 1
            section_count = 0
            
        
        #elif  not seen_rhyme:  # the old logic does not hold, as line 724 in shakespear, where two consecutive sections with only one rhyme specified
            #assert line.split()[0] == 'RHYME'
        elif   line.split()[0] == 'RHYME':   
            rhyme = line.split()[1:]
            #seen_rhyme = True 
            if debug: print (rhyme)
        
        else:
            seen_text = True
            section_text.append(line.split())
            total_line_count += 1
    
    
    
    
    print ('\n\nDone! total line count/section count/title count=', total_line_count, total_section_count, total_title_count)
    
    return word_pair_list, section_map

        
            
#
# auxiliary function to compute the partial ordered pairs from a list of items
#
def get_partial_order_pair_list(list_of_item):
    partial_list = []
    if len(list_of_item)== 2:
        partial_list.append(tuple(list_of_item))

    else:
        for i, item in enumerate(list_of_item[:-1]):
            for j in range(i+1, len(list_of_item)):
                partial_list.append((item, list_of_item[j]))

    return partial_list            
            

def parse_rhyme(rhyme, section_line_count):
    if rhyme[-1] == '*':
        unit = rhyme[:-1]
        assert len(set(unit))==1
        base = ord(unit[0])
        
        assert section_line_count%len(unit) == 0
        
        repeat = (section_line_count- len(unit))/len(unit)
        rhyme =[]
        rhyme.extend(unit)
        for i in range(repeat):
            for j in range(len(unit)):
                rhyme.append(chr(base+i+1))
    
    assert len(rhyme) == section_line_count
    
    letter2index = {}
    for i, letter in enumerate(rhyme):
        letter2index.setdefault(letter, []).append(i)
    #return letter2index
    
    #Format of index_pair_list: (index1, index2, rhyme_letter)
    index_pair_list= []
    
    for rhyme_letter in letter2index:
        index_list = letter2index[rhyme_letter]
        partial_list = get_partial_order_pair_list(index_list)

        for pair in partial_list:
            index1, index2 = pair
            index_pair_list.append((index1, index2, rhyme_letter))


    index_pair_list.sort() 
    new_rhyme_str = ''.join(rhyme)

    return new_rhyme_str, index_pair_list


def remove_tail_punctuation(word):
    last = len(word)
    for i in range(last, 0, -1):
        #print (word[i-1: i])
        if word[i-1: i] not in string.punctuation:
            break
    return word[:i]


def pair_extraction(index_pair_list, section_text):
    
    debug = False
    # section_text are list of words (str)
    
    raw_pair_list = []
    #Format of each pair in raw_pair_list: (word1, word2, letter)
    
    list_of_last_word = [remove_tail_punctuation(sent[-1]) for sent in section_text]
    

    if debug: print ('list of last_word', list_of_last_word)
    
    for index1, index2, rhyme_letter in index_pair_list:
        raw_pair_list.append( (list_of_last_word[index1], list_of_last_word[index2], rhyme_letter))
    
    
    return raw_pair_list

    
    


In [66]:
#
# UNIT TEST for parse_rhyme()
#

r=get_partial_order_pair_list([1, 3, 4])
print ('sample partial list = ', r)


a=parse_rhyme('a b a b b c c'.split(), 7)
b=parse_rhyme('a a *'.split(), 8)
print (a, b)

sample partial list =  [(1, 3), (1, 4), (3, 4)]
('ababbcc', [(0, 2, 'a'), (1, 3, 'b'), (1, 4, 'b'), (3, 4, 'b'), (5, 6, 'c')]) ('aabbccdd', [(0, 1, 'a'), (2, 3, 'b'), (4, 5, 'c'), (6, 7, 'd')])


In [10]:
#
# UNIT TEST for pair_extraction() 
#

section_text = '''From off a hill whose concave womb reworded
A plaintful story from a sistering vale,
My spirits to attend this double voice accorded,
And down I laid to list the sad-tuned tale;
Ere long espied a fickle maid full pale, 
Tearing of papers, breaking rings a-twain,
Storming her world with sorrow's wind and rain.
'''

with open('tmp','w') as f:
    f.write(section_text)

with open('tmp','r') as f:
    text = f.readlines()
    section_text = [l.split() for l in text]
print (section_text)


new_rhyme_str, index_pair_list = parse_rhyme ('a b a b b c c'.split() ,7)
print ('index_pair_list:', index_pair_list)

raw_pair_list = pair_extraction(index_pair_list, section_text)

print ('\nRaw pair list:')
for i in raw_pair_list:
    print (i)

[['From', 'off', 'a', 'hill', 'whose', 'concave', 'womb', 'reworded'], ['A', 'plaintful', 'story', 'from', 'a', 'sistering', 'vale,'], ['My', 'spirits', 'to', 'attend', 'this', 'double', 'voice', 'accorded,'], ['And', 'down', 'I', 'laid', 'to', 'list', 'the', 'sad-tuned', 'tale;'], ['Ere', 'long', 'espied', 'a', 'fickle', 'maid', 'full', 'pale,'], ['Tearing', 'of', 'papers,', 'breaking', 'rings', 'a-twain,'], ['Storming', 'her', 'world', 'with', "sorrow's", 'wind', 'and', 'rain.']]
index_pair_list: [(0, 2, 'a'), (1, 3, 'b'), (1, 4, 'b'), (3, 4, 'b'), (5, 6, 'c')]

Raw pair list:
('reworded', 'accorded', 'a')
('vale', 'tale', 'b')
('vale', 'pale', 'b')
('tale', 'pale', 'b')
('a-twain', 'rain', 'c')


In [11]:
#
# Run the whole program for Shakespear
#
path_shake = '../data/shakespeare.txt'
word_pair_list_shakes, section_map_shakes =  poem_extract(path_shake)

wp_path = '../working_data/shakes.pair.pkl'
sm_path = '../working_data/shakes.map.pkl'

with open(wp_path, 'wb') as f:
    pickle.dump(word_pair_list_shakes, f)
print ('Word pair list have been pickled to ', wp_path)

with open(sm_path, 'wb') as f:
    pickle.dump(section_map_shakes, f)
print ('Section map have been pickled to ', sm_path)





Done! total line count/section count/title count= 5949 722 24
Word pair list have been pickled to  ../working_data/shakes.pair.pkl
Section map have been pickled to  ../working_data/shakes.map.pkl


In [97]:
#
# Inspect the results for Shakespear
#

for i in word_pair_list_shakes [:20]:
    print (i)

k_list = section_map_shakes.keys()[8:10]
for k in k_list:
    print (k, ':', section_map_shakes[k])

('reworded', 'accorded', 'ababbcc', 'a', "A Lover's Complaint", 0, 'William Shakespeare')
('vale', 'tale', 'ababbcc', 'b', "A Lover's Complaint", 0, 'William Shakespeare')
('vale', 'pale', 'ababbcc', 'b', "A Lover's Complaint", 0, 'William Shakespeare')
('tale', 'pale', 'ababbcc', 'b', "A Lover's Complaint", 0, 'William Shakespeare')
('a-twain', 'rain', 'ababbcc', 'c', "A Lover's Complaint", 0, 'William Shakespeare')
('straw', 'saw', 'ababbcc', 'a', "A Lover's Complaint", 1, 'William Shakespeare')
('sun', 'done', 'ababbcc', 'b', "A Lover's Complaint", 1, 'William Shakespeare')
('sun', 'begun', 'ababbcc', 'b', "A Lover's Complaint", 1, 'William Shakespeare')
('done', 'begun', 'ababbcc', 'b', "A Lover's Complaint", 1, 'William Shakespeare')
('rage', 'age', 'ababbcc', 'c', "A Lover's Complaint", 1, 'William Shakespeare')
('eyne', 'brine', 'ababbcc', 'a', "A Lover's Complaint", 2, 'William Shakespeare')
('characters', 'tears', 'ababbcc', 'b', "A Lover's Complaint", 2, 'William Shakespeare'

In [35]:
#
# Run the whole program Housman
#

path_shake = '../data/housman.txt'
word_pair_list_housman, section_map_housman =  poem_extract(path_shake)

wp_path = '../working_data/housman.pair.pkl'
sm_path = '../working_data/housman.map.pkl'

with open(wp_path, 'wb') as f:
    pickle.dump(word_pair_list_housman, f)
print ('Word pair list have been pickled to ', wp_path)

with open(sm_path, 'wb') as f:
    pickle.dump(section_map_housman, f)
print ('Section map have been pickled to ', sm_path)



Done! total line count/section count/title count= 3215 659 177
Word pair list have been pickled to  ../working_data/housman.pair.pkl
Section map have been pickled to  ../working_data/housman.map.pkl


In [13]:
# shakespeare writes way more sections per poem and more lines per sections...
print ('shakespeare:',722/24.0, 5949/722.0)
print ('housman',659/177.0, 3215/659.0)


# import shakespeare pickles...
word_pair_shakes = pickle.load(open('../working_data/shakes.pair.pkl', 'rb'))
for i in word_pair_shakes [:20]:
    print (i)

shakespeare: 30.0833333333 8.23961218837
housman 3.72316384181 4.87860394537
(u'reworded', u'accorded', u'ababbcc', u'a', u"A Lover's Complaint", 0, u'William Shakespeare')
(u'vale', u'tale', u'ababbcc', u'b', u"A Lover's Complaint", 0, u'William Shakespeare')
(u'vale', u'pale', u'ababbcc', u'b', u"A Lover's Complaint", 0, u'William Shakespeare')
(u'tale', u'pale', u'ababbcc', u'b', u"A Lover's Complaint", 0, u'William Shakespeare')
(u'a-twain', u'rain', u'ababbcc', u'c', u"A Lover's Complaint", 0, u'William Shakespeare')
(u'straw', u'saw', u'ababbcc', u'a', u"A Lover's Complaint", 1, u'William Shakespeare')
(u'sun', u'done', u'ababbcc', u'b', u"A Lover's Complaint", 1, u'William Shakespeare')
(u'sun', u'begun', u'ababbcc', u'b', u"A Lover's Complaint", 1, u'William Shakespeare')
(u'done', u'begun', u'ababbcc', u'b', u"A Lover's Complaint", 1, u'William Shakespeare')
(u'rage', u'age', u'ababbcc', u'c', u"A Lover's Complaint", 1, u'William Shakespeare')
(u'eyne', u'brine', u'ababbcc', u

Working with embeddings requires Dissect (https://github.com/composes-toolkit/dissect). We use the 'dense' format, which requires the embeddings in the format [word numeric representation]. We use the 300-dimensional pre-trained GloVe embeddings from http://nlp.stanford.edu/projects/glove/.

In [10]:
# load embedding space
from composes.semantic_space.space import Space
glove_embeddings_file = '../data/embeddings/glove.6B.300d_dense.emb'
print("==> loading embeddings from " + glove_embeddings_file)
glove_word_space = Space.build(data=glove_embeddings_file, format='dm')
print("==> loaded.")

==> loading embeddings from ../data/embeddings/glove.6B.300d_dense.emb
==> loaded.


In [28]:
# extract the embeddings for the selected word pairs
def extract_embeddings(embeddings_space, word_pairs):
    word_list = embeddings_space.get_id2row()
    existing_pairs_list = []
    for wp in word_pairs:
        if wp[0].encode('utf8') in word_list:
            w1_emb = embeddings_space.get_row(wp[0]).mat
            if wp[1].encode('utf8') in word_list:
                w2_emb = embeddings_space.get_row(wp[1]).mat
                
                #both words exist, append them to the 'existing' list
                new_wp = list(wp)
                new_wp.append(w1_emb)
                new_wp.append(w2_emb)
                new_wp_tuple = tuple(new_wp)
                existing_pairs_list.append(new_wp_tuple)
                
    print("==> "+ str(len(existing_pairs_list)) +" representations found out of " + str(len(word_pairs)))
    return existing_pairs_list

In [38]:
import cPickle as pickle

# extract GloVe representations for the Shakespeare data
word_pair_shakes = pickle.load(open('../working_data/shakes.pair.pkl', 'rb'))
glove_shakespeare_wp = extract_embeddings(glove_word_space, word_pair_shakes)
glove_shakespeare_wp_path = '../working_data/shakes.pair.glove.pkl'

with open(glove_shakespeare_wp_path, 'wb') as f:
    pickle.dump(glove_shakespeare_wp, f)
print ('Word pair list with representations have been pickled to ', glove_shakespeare_wp_path)

==> 3147 representations found out of 3454
Word pair list with representations have been pickled to  ../working_data/shakes.pair.glove.pkl


In [37]:
import cPickle as pickle

# extract GloVe representations for the Housman data
word_pair_housman = pickle.load(open('../working_data/housman.pair.pkl', 'rb'))
glove_housman_wp = extract_embeddings(glove_word_space, word_pair_housman)
glove_housman_wp_path = '../working_data/housman.pair.glove.pkl'

with open(glove_housman_wp_path, 'wb') as f:
    pickle.dump(glove_housman_wp, f)
print ('Word pair list with representations have been pickled to ', glove_housman_wp_path)

==> 1407 representations found out of 1548
Word pair list with representations have been pickled to  ../working_data/housman.pair.glove.pkl
