In [1]:
import xml
import xml.etree.ElementTree as ET
from itertools import islice
from pprint import pprint

## PostHistory

Here is an XML file with some PostHistory entries from Meta:Outdoors. The "tree" represents the whole document (?) and the root is the node ancestor of all other nodes.

The closest thing I can find to a scehma is detailed here: http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

Also see: http://data.stackexchange.com/stackoverflow/query/36599/show-all-types

In [2]:
url = "https://archive.org/download/stackexchange/sports.stackexchange.com.7z"
url2 = "https://ia800500.us.archive.org/22/items/stackexchange/sports.stackexchange.com.7z"
file1 = "sports.stackexchange.com.7z"
folder = "SportsExchangeData"

downloads zip7 file to the current working directory.
https://archive.org/download/stackexchange/sports.stackexchange.com.7z

In [3]:
import urllib

testfile = urllib.URLopener()
testfile.retrieve(url2, file1)

('sports.stackexchange.com.7z', <httplib.HTTPMessage instance at 0x104054d40>)

unzips file and puts the content in a folder called "SportsExchangeData" in the current directory

In [4]:
import subprocess
cmd = r'"/usr/local/bin/7z" x ' + file1 + ' -o' + folder
subprocess.call(cmd, shell = True)

134

In [5]:
tree = ET.parse(folder +"/PostHistory.xml")
root = tree.getroot()

PostHistoryTypeId 10 means Closed. The Comment in a PostHistory close should be a number containing a CloseReasonType. CloseReasonType 1 and 101 both mean "duplicate" (1 in an old schema, 101 in a new schema).

In [6]:
close_code = '10'
edit_body_code = '5'
duplicate_codes = ('1', '101')
voted_to_close = root.findall(".//row[@PostHistoryTypeId='{0}']".format(close_code))
for entry in voted_to_close:
    close_reason = entry.attrib['Comment']
    if close_reason in duplicate_codes:
        print "PostHistory ID: {0}".format(entry.attrib['Id'])
        print entry.attrib['Text']
        print entry.attrib['RevisionGUID']
        print entry.attrib['PostId']
        print entry.attrib.keys()

PostHistory ID: 1585
{"OriginalQuestionIds":[179],"Voters":[{"Id":6,"DisplayName":"wax eagle"},{"Id":38,"DisplayName":"Tonny Madsen"}]}
4b43ed3f-76fb-4a67-b953-7c4b45c23eba
562
['Comment', 'PostId', 'UserId', 'PostHistoryTypeId', 'Text', 'RevisionGUID', 'CreationDate', 'Id']
PostHistory ID: 5372
{"OriginalQuestionIds":[1010],"Voters":[{"Id":385,"DisplayName":"Dor Cohen"},{"Id":786,"DisplayName":"SocioMatt"},{"Id":527,"DisplayName":"edmastermind29"}]}
10f12d59-24c8-4d98-ab11-33e33ca79fd8
1676
['Comment', 'PostId', 'UserId', 'PostHistoryTypeId', 'Text', 'RevisionGUID', 'CreationDate', 'Id']
PostHistory ID: 8821
{"OriginalQuestionIds":[2128],"Voters":[{"Id":103,"DisplayName":"Rory Alsop"},{"Id":385,"DisplayName":"Dor Cohen"},{"Id":527,"DisplayName":"edmastermind29"}]}
c4fbc791-7fd2-4922-b7fd-814697b36c1c
2303
['Comment', 'PostId', 'UserId', 'PostHistoryTypeId', 'Text', 'RevisionGUID', 'CreationDate', 'Id']
PostHistory ID: 9774
{"OriginalQuestionIds":[179],"Voters":[{"Id":385,"DisplayName"

The `Text` attribute in a PostHistory event indicating a post was closed as a duplicate is itself a JSON-formatted string describing the vote to close. It is unclear where other information relevant to the closing event is stored.

There is, for instance, an event populated by the Community user that adds the "Possible Duplicate" boilerplate on the post:

In [7]:
add_boilerplate = root.findall(".//row[@PostHistoryTypeId='{0}'][@Comment='insert duplicate link']".format(edit_body_code))
pprint(add_boilerplate[0].attrib)

{'Comment': 'insert duplicate link',
 'CreationDate': '2012-03-22T14:35:03.113',
 'Id': '1584',
 'PostHistoryTypeId': '5',
 'PostId': '562',
 'RevisionGUID': '8c47ddc6-e49b-4170-9bad-0c0b6c477d58',
 'Text': "> **Possible Duplicate:**  \n> [Why is FIFA against adding instant replay to the game?](http://sports.stackexchange.com/questions/179/why-is-fifa-against-adding-instant-replay-to-the-game)  \n\n<!-- End of automatically inserted text -->\n\nAfter watching England's world cup match i was eagerly waiting for FIFA to introduce the goal line technology. But it hasn't come out yet. \r\n\r\nWhat are the disadvantages of using it? In all other games now days technology is very much used and has resulted in a reduction of refereeing or umpiring errors.  \r\n\r\nWhat has FIFA got against the use of technology in the game? ",
 'UserId': '-1'}


That appears to be triggered by the initial vote to close, for this post. It is also theoretically possible for a post to be *flagged* as a duplicate by a different mechanism, which conceivably would not show up in the PostHistory but in wherever the Flag histories are kept.