Skip to content

SDK Train

Sergio Matos edited this page Apr 13, 2021 · 3 revisions

If users do not want to take advantage of the Train CLI tool, it is also straightforward to train a machine-learning model for NER programatically. Such process has two different phases:

  • Phase 1: read sentences and annotations and build the corpus with NLP data;
  • Phase 2: train the model based on the built corpus.

In the end, the model can be serialized into a file or uses on a processing pipeline.

The following source code snippet shows how to train a machine-learning model for NER, using the data provided on the "example" folder.

// Set files
String sentencesFile = "example/train/sentences";
String annotationsFile = "example/train/annotations";
String modelConfigurationFile = "example/train/model.config";
String modelFile = "example/train/model.gz";

// Create parser
Parser parser = new GDepParser(ParserLanguage.ENGLISH, ParserLevel.CHUNKING, new LingpipeSentenceSplitter(), false).launch();

// Set sentences and annotations streams
InputStream sentencesStream = new FileInputStream(sentencesFile);
InputStream annotationsStream = new FileInputStream(annotationsFile);

// Run pipeline to get corpus from sentences and annotations
Pipeline pipelinePhase1 = new TrainPipelinePhase1()
        .add(new BC2Reader(parser, null, annotationsStream))
        .add(new TrainNLP(parser));
pipelinePhase1.run(sentencesStream);

// Close sentences and annotations streams
sentencesStream.close();
annotationsStream.close();

// Get corpus
Corpus corpus = pipelinePhase1.getCorpus();

// Get model configuration
InputStream inputStream = new ByteArrayInputStream(" ".getBytes("UTF-8"));
ModelConfig modelConfig = new ModelConfig(modelConfigurationFile);

// Run pipeline to train model on corpus
Pipeline pipelinePhase2 = new TrainPipelinePhase2()
        .add(new DefaultTrainer(modelConfig));
pipelinePhase2.setCorpus(corpus);
pipelinePhase2.run(inputStream);

// Close input stream
inputStream.close();

// Get trained model and write to file
CRFModel model = (CRFModel) pipelinePhase2.getModuleData("TRAINED_MODEL").get(0);
model.write(new GZIPOutputStream(new FileOutputStream(modelFile)));