mkdir xml/
BUGZILLA_LOGIN=~~~~ BUGZILLA_LOGINCOOKIE=~~~~~~~~~~ ./bugzilla-to-xml.py
This will take about 36 hours to fetch 53000 bugs, totaling 2.9GB of disk space.
Bugzilla will accept our requests and serve partial data
even if we do not present a valid BUGZILLA_LOGINCOOKIE.
However, in that case it will strip the domains of email
addresses (see this SO question).
If you want email addresses (which you do), you must
provide your login cookies.
This step is repeatable, but it takes a very long time.
You can parallelize it by running copies of the script with
different ranges of (FIRST_BUGZILLA_NUMBER, LAST_BUGZILLA_NUMBER),
all writing into the same xml/ directory.
Some bug numbers don't exist in Bugzilla. We can detect these based on what the XML looks like, and segregate them into a different directory.
mkdir invalid-xml/
for i in $(grep -rL short_desc xml/) ; do mv $i invalid-xml/ ; done
This will take about two minutes, and reduce the
number of files in the xml/ directory to 51567.
You can inspect the files in invalid-xml/ to see
what's being thrown out.
./xml-to-json.py
This will take about seven minutes to produce 51567 JSON files, totaling 349MB of disk space.
The resulting JSON files use a schema that's roughly the same as the one you get by exporting from GitHub's Issues API. Notice that this is not quite the same as the schema supported by GitHub's undocumented Issues Import API.
As noted above, this step can't just pipe our JSON files directly into GitHub's undocumented Issues Import API, because the schemas don't quite match. It needs to do some extra massaging. Also, unless you have magic GitHub superpowers, some of the data must necessarily be thrown away: for example, our JSON provides the authorship of comments, but we cannot actually forge the authorship of those comments when we import them to our GitHub project. We rightly don't have the power to impersonate random GitHub users and submit comments under their names!
In fact, we must provide a GitHub "Personal API Token" (which you can generate here if you're logged into GitHub right now); otherwise GitHub won't accept our API request at all. All issues and comments created by this script will be associated with the account whose API token you use; there's (rightly) no way for this script to forge issues or comments as if they came from other GitHub users.
GITHUB_API_TOKEN=~~~ ./json-to-github.py
This last step is irreversible! It increments the issue numbers on the GitHub repo you point it at. There is no way to "decrement and try again," except to delete the entire repo and re-create it.
This will take about 17 hours to upload 51567 issues to GitHub.
For my blog post with more information about this repo, see
- "I sweded the LLVM Bugzilla migration" (2021-12-11)
For other GitHub Issues API scripts, see