In [48]:
import xml
import xml.etree.ElementTree as ET
from itertools import islice
from pprint import pprint

## PostHistory

Here is an XML file with some PostHistory entries from Meta:Outdoors. The "tree" represents the whole document (?) and the root is the node ancestor of all other nodes.

The closest thing I can find to a scehma is detailed here: http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

Also see: http://data.stackexchange.com/stackoverflow/query/36599/show-all-types

In [4]:
tree = ET.parse("./PostHistory.xml")
root = tree.getroot()

PostHistoryTypeId 10 means Closed. The Comment in a PostHistory close should be a number containing a CloseReasonType. CloseReasonType 1 and 101 both mean "duplicate" (1 in an old schema, 101 in a new schema).

In [39]:
close_code = '10'
edit_body_code = '5'
duplicate_codes = ('1', '101')
voted_to_close = root.findall(".//row[@PostHistoryTypeId='{0}']".format(close_code))
for entry in voted_to_close:
    close_reason = entry.attrib['Comment']
    if close_reason in duplicate_codes:
        print "PostHistory ID: {0}".format(entry.attrib['Id'])
        print entry.attrib['Text']
        print entry.attrib['RevisionGUID']
        print entry.attrib['PostId']
        print entry.attrib.keys()

PostHistory ID: 361
{"OriginalQuestionIds":[165],"Voters":[{"Id":66,"DisplayName":"Rory Alsop"},{"Id":18,"DisplayName":"Kevin"}]}
2ec173a0-4a3e-48cb-a742-85c24a206e26
178
['Comment', 'PostId', 'UserId', 'PostHistoryTypeId', 'Text', 'RevisionGUID', 'CreationDate', 'Id']
PostHistory ID: 1358
{"OriginalQuestionIds":[572],"Voters":[{"Id":2653,"DisplayName":"Wills"},{"Id":-1,"DisplayName":"Community","BindingReason":{"DuplicateApprovedByAsker":""}}]}
c4e8c84c-af50-4785-9a93-88be5629181c
602
['Comment', 'PostId', 'UserId', 'PostHistoryTypeId', 'Text', 'RevisionGUID', 'CreationDate', 'Id']


The `Text` attribute in a PostHistory event indicating a post was closed as a duplicate is itself a JSON-formatted string describing the vote to close. It is unclear where other information relevant to the closing event is stored.

There is, for instance, an event populated by the Community user that adds the "Possible Duplicate" boilerplate on the post:

In [49]:
add_boilerplate = root.findall(".//row[@PostHistoryTypeId='{0}'][@Comment='insert duplicate link']".format(edit_body_code))
pprint(add_boilerplate[0].attrib)

{'Comment': 'insert duplicate link',
 'CreationDate': '2012-02-29T23:07:20.757',
 'Id': '360',
 'PostHistoryTypeId': '5',
 'PostId': '178',
 'RevisionGUID': '367d951f-bc0d-4f78-9065-abafc8b3036b',
 'Text': "> **Possible Duplicate:**  \n> [How can we drive more questions on the site?](http://meta.outdoors.stackexchange.com/questions/165/how-can-we-drive-more-questions-on-the-site)  \n\n<!-- End of automatically inserted text -->\n\nIt seems that we had a great start on this beta but now the activity is really dropping. Is this natural or something to be worried about, and if so what's the best course of action to try and get the site back on track?",
 'UserId': '-1'}


That appears to be triggered by the initial vote to close, for this post. It is also theoretically possible for a post to be *flagged* as a duplicate by a different mechanism, which conceivably would not show up in the PostHistory but in wherever the Flag histories are kept.