Skip to content

Commit

Permalink
(stackexchange-converter) Create tool for converting stackexchange 7z…
Browse files Browse the repository at this point in the history
…-files to digestible sqlite db:s
  • Loading branch information
vlofgren committed Sep 20, 2023
1 parent 3b4d08f commit 5b0a6d7
Show file tree
Hide file tree
Showing 7 changed files with 141 additions and 2 deletions.
19 changes: 19 additions & 0 deletions code/features-convert/stackexchange-xml/readme.md
@@ -0,0 +1,19 @@
Stackexchange's data is a jumble of questions and answers,
where the answers refer to the questions with a parentId field.

e.g.
```xml
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="51" CreationDate="2016-01-12T18:45:19.963" Score="10" ViewCount="424" Body="&lt;p&gt;When I've printed an object I've had to choose between high resolution and quick prints. What techniques or technologies can I use or deploy to speed up my high resolution prints?&lt;/p&gt;&#xA;" OwnerUserId="16" LastActivityDate="2017-10-31T02:31:08.560" Title="How to obtain high resolution prints in a shorter period of time?" Tags="&lt;resolution&gt;&lt;speed&gt;&lt;quality&gt;" AnswerCount="2" CommentCount="6" ContentLicense="CC BY-SA 3.0" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="12" CreationDate="2016-01-12T18:45:51.287" Score="34" ViewCount="7377" Body="&lt;p&gt;I would like to buy a 3D printer, but I'm concerned about the health risks that are associated with its operation. Some groups of scientists say it can be &lt;a href=&quot;http://www.techworld.com/news/personal-tech/scientists-warn-of-3d-printing-health-effects-as-tech-hits-high-street-3460992/&quot;&gt;harmful&lt;/a&gt; for humans.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;What do I need to consider before buying a 3D printer if I care about my health? Are there any safe printers?&lt;/p&gt;&#xA;" OwnerUserId="20" LastEditorUserId="334" LastEditDate="2016-11-15T16:16:11.163" LastActivityDate="2019-06-10T23:18:34.190" Title="Is 3D printing safe for your health?" Tags="&lt;print-material&gt;&lt;safety&gt;&lt;health&gt;" AnswerCount="4" CommentCount="1" ContentLicense="CC BY-SA 3.0" />
<row Id="12" PostTypeId="2" ParentId="2" CreationDate="2016-01-12T19:13:00.710" Score="23" Body="&lt;p&gt;There is very little information about safety available, as home 3D printers are relatively new. However, plastics such as ABS have a long history in making plastic products, and a study found..." />
</posts>
```

Since the search engine wants to extract keywords for each thread
holistically, not by question or answer, it is necessary to re-arrange
the data (which is very large). SQLite does a decent job of enabling
this task.

See [tools/stackexchange-converter](../../tools/stackexchange-converter).
Expand Up @@ -33,8 +33,9 @@ public class StackExchangePostsDb {

/** Construct a SQLIte file containing the Posts in the stack exchange-style 7z file */
@SneakyThrows
public static void create(Path sqliteFile,
Path stackExchange7zFile) {
public static void create(String domain,
Path sqliteFile,
Path stackExchange7zFile) {
if (Files.exists(sqliteFile))
Files.delete(sqliteFile);
String connStr = "jdbc:sqlite:" + sqliteFile;
Expand All @@ -58,6 +59,13 @@ public static void create(Path sqliteFile,
stackExchange7zFile
);

var insertMeta = connection.prepareStatement("""
INSERT INTO metadata(domainName)
VALUES (?)
""");
insertMeta.setString(1, domain);
insertMeta.executeUpdate();

var insertPost = connection.prepareStatement("""
INSERT INTO post(id, threadId, postYear, title, body, origSize, tags)
VALUES (?, ?, ?, ?, ?, ?, ?)
Expand Down
Expand Up @@ -8,4 +8,8 @@ CREATE TABLE post (
tags TEXT
);

CREATE TABLE metadata (
domainName TEXT
);

CREATE INDEX post_threadId ON post(threadId);
52 changes: 52 additions & 0 deletions code/tools/stackexchange-converter/build.gradle
@@ -0,0 +1,52 @@
plugins {
id 'java'
id "io.freefair.lombok" version "8.2.2"
id 'application'

id 'jvm-test-suite'
}

java {
toolchain {
languageVersion.set(JavaLanguageVersion.of(20))
}
}

application {
mainClass = 'nu.marginalia.tools.StackexchangeConverter'
applicationName = 'stackexchange-converter'
}

tasks.distZip.enabled = false

dependencies {
implementation project(':code:features-convert:stackexchange-xml')

implementation libs.lombok
annotationProcessor libs.lombok
implementation libs.bundles.slf4j
implementation libs.notnull

implementation libs.guice
implementation libs.jsoup
implementation libs.trove
implementation libs.fastutil

implementation libs.bundles.nlp
implementation libs.commons.lang3

testImplementation libs.bundles.slf4j.test
testImplementation libs.bundles.junit
testImplementation libs.mockito
}


test {
useJUnitPlatform()
}

task fastTests(type: Test) {
useJUnitPlatform {
excludeTags "slow"
}
}
24 changes: 24 additions & 0 deletions code/tools/stackexchange-converter/readme.md
@@ -0,0 +1,24 @@
This tool converts from stackexchange's 7z-compressed XML
format to a sqlite database that is digestible by the search engine.

See [features-convert/stackexchange-xml](../../features-convert/stackexchange-xml) for
an explanation why this is necessary.

Stackexchange's data dumps can be downloaded from archive.org
here: [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)

<b>Usage</b>

```shell
$ stackexchange-converter domain-name input.7z output.db
```

Stackexchange is relatively conservative about allowing
new questions, so this is a job that doesn't run more than once.

<b>Note</b>: Reading and writing these db files is *absurdly* slow
on a mechanical hard-drive.

## See Also

* [features-convert/stackexchange-xml](../../features-convert/stackexchange-xml)
@@ -0,0 +1,31 @@
package nu.marginalia.tools;

import nu.marginalia.integration.stackexchange.sqlite.StackExchangePostsDb;

import java.nio.file.Files;
import java.nio.file.Path;

public class StackexchangeConverter {
public static void main(String[] args) {

if (args.length != 3) {
System.err.println("Converts a stackexchange Posts 7z file to a Marginalia-digestible sqlite-db\n");
System.err.println("Arguments: domain-name input-file.7z output-file.db");
return;
}

String domain = args[0];

Path inputFile = Path.of(args[1]);
Path outputFile = Path.of(args[2]);

if (!Files.exists(inputFile))
System.err.println("Input file " + inputFile + " does not exists");

System.out.println("Converting " + inputFile);

StackExchangePostsDb.create(domain, outputFile, inputFile);

System.out.println("... done!");
}
}
1 change: 1 addition & 0 deletions settings.gradle
Expand Up @@ -75,6 +75,7 @@ include 'code:tools:experiment-runner'
include 'code:tools:website-adjacencies-calculator'
include 'code:tools:screenshot-capture-tool'
include 'code:tools:load-test'
include 'code:tools:stackexchange-converter'

include 'third-party:porterstemmer'
include 'third-party:xz'
Expand Down

0 comments on commit 5b0a6d7

Please sign in to comment.