Skip to content

SDK Convert Corpus

André Santos edited this page Nov 20, 2016 · 2 revisions

If users do not want, it is also straightforward to convert a corpus from one format to another programatically.

The following source code snippet shows how to convert a corpus, by creating a processing pipeline and using the data provided on the "example" folder.

// Set files
String documentsDirectory = "example/annotate/a1/in/";
String outputDirectory = "example/annotate/out/";

// Set input and output formats
InputFormat inputFormat = InputFormat.A1;
List<OutputFormat> outputFormats = new ArrayList();
outputFormats.add(OutputFormat.CONLL);

// Create context
ContextConfiguration config = new ContextConfiguration.Builder()
        .withInputFormat(inputFormat)
        .withOutputFormats(outputFormats)
        .withParserTool(ParserTool.GDEP)
        .withParserLanguage(ParserLanguage.ENGLISH)
        .withParserLevel(ParserLevel.CHUNKING)
        .build();

Context context = new Context(config, null, null);

// Create batch executor
boolean compressed = false;
int numThreads = 1;
BatchExecutor batch = new FileBatchExecutor(documentsDirectory, outputDirectory, compressed, numThreads, false, true);

// Run batch processing
batch.run(FileProcessor.class, context);
Clone this wiki locally