Skip to content

Commit 9275280

Browse files
Grace MuznyStanford NLP
authored andcommitted
merge master
1 parent f0b13fc commit 9275280

File tree

170 files changed

+85250
-86078
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

170 files changed

+85250
-86078
lines changed

README.md

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,39 @@ Stanford CoreNLP provides a set of natural language analysis tools written in Ja
55

66
The Stanford CoreNLP code is written in Java and licensed under the GNU General Public License (v3 or later). Note that this is the full GPL, which allows many free uses, but not its use in proprietary software that you distribute to others.
77

8-
#### How To Compile (with ant)
8+
#### Build Instructions
99

10-
1. cd CoreNLP ; ant
10+
Several times a year we distribute a new version of the software, which corresponds to a stable commit.
1111

12-
#### How To Create A Jar
12+
During the time between releases, one can always use the latest, under development version of our code.
1313

14-
1. compile the code
15-
2. cd CoreNLP/classes ; jar -cf ../stanford-corenlp.jar edu
14+
Here are some helfpul instructions to use the latest code:
15+
16+
1. Make sure you have ant installed.
17+
2. Compile the code with this command: `cd CoreNLP ; ant`
18+
3. Then run this command to build a jar with the latest version of the code: `cd CoreNLP/classes ; jar -cf ../stanford-corenlp.jar edu`
19+
4. This will create a new jar called stanford-corenlp.jar in the CoreNLP folder which contains the latest code
20+
5. The dependencies that work with the latest code are in CoreNLP/lib and CoreNLP/liblocal, so make sure to include those in your CLASSPATH.
21+
6. Also make sure to download the latest versions of the [corenlp-models](http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar),
22+
and [english-models](http://nlp.stanford.edu/software/stanford-english-corenlp-models-current.jar), and include them in your CLASSPATH. If you
23+
are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.
1624

1725
You can find releases of Stanford CoreNLP on [Maven Central](http://search.maven.org/#browse%7C11864822).
1826

1927
You can find more explanation and documentation on [the Stanford CoreNLP homepage](http://nlp.stanford.edu/software/corenlp.shtml#Demo).
2028

2129
The most recent models associated with the code in the HEAD of this repository can be found [here](http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar).
2230

23-
Some of the larger (English) models -- like the shift-reduce parser and WikiDict -- are not distributed with our default models jar.
31+
Some of the larger (English) models -- like the shift-reduce parser and WikiDict -- are not distributed with our default models jar.
2432
The most recent version of these models can be found [here](http://nlp.stanford.edu/software/stanford-english-corenlp-models-current.jar).
2533

34+
We distribute resources for other languages as well, including [Arabic models](http://nlp.stanford.edu/software/stanford-arabic-corenlp-models-current.jar),
35+
[Chinese models](http://nlp.stanford.edu/software/stanford-chinese-corenlp-models-current.jar),
36+
[French models](http://nlp.stanford.edu/software/stanford-french-corenlp-models-current.jar),
37+
[German models](http://nlp.stanford.edu/software/stanford-german-corenlp-models-current.jar),
38+
and [Spanish models](http://nlp.stanford.edu/software/stanford-spanish-corenlp-models-current.jar).
39+
2640
For information about making contributions to Stanford CoreNLP, see the file [CONTRIBUTING.md](CONTRIBUTING.md).
2741

28-
Questions about CoreNLP can either be posted on StackOverflow with the tag [stanford-nlp](http://stackoverflow.com/questions/tagged/stanford-nlp),
42+
Questions about CoreNLP can either be posted on StackOverflow with the tag [stanford-nlp](http://stackoverflow.com/questions/tagged/stanford-nlp),
2943
or on the [mailing lists](http://nlp.stanford.edu/software/corenlp.shtml#Mail).

build.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@
160160
<target name="itest" depends="classpath,compile"
161161
description="Run core integration tests">
162162
<echo message="${ant.project.name}" />
163-
<junit fork="yes" maxmemory="8g" printsummary="off" outputtoformatters="false" forkmode="perTest" haltonfailure="true">
163+
<junit fork="yes" maxmemory="10g" printsummary="off" outputtoformatters="false" forkmode="perTest" haltonfailure="true">
164164
<classpath refid="classpath"/>
165165
<classpath path="${build.path}"/>
166166
<classpath path="${data.path}"/>

data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,11 @@ NN=target <... {/\\%/}
9898

9999
relabel target SYM
100100

101+
% fused det-noun pronouns -> PRON
102+
NN=target < (/^(?i:(somebody|something|someone|anybody|anything|anyone|everybody|everything|everyone|nobody|nothing))$/)
103+
104+
relabel target PRON
105+
101106
% NN -> NOUN (otherwise)
102107
NN=target <... {/.*/}
103108

doc/corenlp/README.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,9 @@ LICENSE
4242
CHANGES
4343
---------------------------------
4444

45+
2016-10-30 3.7.0 KBP Annotator, improved coreference, Arabic
46+
pipeline
47+
4548
2015-12-09 3.6.0 Improved coreference, OpenIE integration,
4649
Stanford CoreNLP server
4750

doc/corenlp/pom-full.xml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
<modelVersion>4.0.0</modelVersion>
33
<groupId>edu.stanford.nlp</groupId>
44
<artifactId>stanford-corenlp</artifactId>
5-
<version>3.6.0</version>
5+
<version>3.7.0</version>
66
<packaging>jar</packaging>
77
<name>Stanford CoreNLP</name>
88
<description>Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. It provides the foundational building blocks for higher level text understanding applications.</description>
@@ -14,8 +14,8 @@
1414
</license>
1515
</licenses>
1616
<scm>
17-
<url>http://nlp.stanford.edu/software/stanford-corenlp-2015-12-06.zip</url>
18-
<connection>http://nlp.stanford.edu/software/stanford-corenlp-2015-12-06.zip</connection>
17+
<url>http://nlp.stanford.edu/software/stanford-corenlp-2016-10-30.zip</url>
18+
<connection>http://nlp.stanford.edu/software/stanford-corenlp-2016-10-30.zip</connection>
1919
</scm>
2020
<developers>
2121
<developer>
@@ -88,7 +88,7 @@
8888
<configuration>
8989
<artifacts>
9090
<artifact>
91-
<file>${project.basedir}/stanford-corenlp-3.6.0-models.jar</file>
91+
<file>${project.basedir}/stanford-corenlp-3.7.0-models.jar</file>
9292
<type>jar</type>
9393
<classifier>models</classifier>
9494
</artifact>

doc/tagger/README-Models.txt

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -105,15 +105,11 @@ University of Stuttgart and the Seminar für Sprachwissenschaft of the
105105
University of Tübingen. See:
106106
http://www.ims.uni-stuttgart.de/projekte/CQPDemos/Bundestag/help-tagset.html
107107
This model uses features from the distributional similarity clusters
108-
built over the HGC.
108+
built over the HGC (Huge German Corpus).
109109
Performance:
110110
96.90% on the first half of the remaining 20% of the Negra corpus (dev set)
111111
(90.33% on unknown words)
112112

113-
german-dewac.tagger
114-
This model uses features from the distributional similarity clusters
115-
built from the deWac web corpus.
116-
117113
german-fast.tagger
118114
Lacks distributional similarity features, but is several times faster
119115
than the other alternatives.

itest/src/edu/stanford/nlp/coref/hybrid/ChineseCorefBenchmarkSlowITest.java

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -47,9 +47,9 @@ private static String runCorefTest(boolean deleteOnExit) throws Exception {
4747
String currentDir = System.getProperty("user.dir");
4848
System.err.println("Current dir using System:" +currentDir);
4949

50-
String[] corefArgs = { "-props", "edu/stanford/nlp/coref/hybrid/properties/zh-conll.properties",
51-
'-' + CorefProperties.LOG_PROP, baseLogFile,
52-
'-' + CorefProperties.PATH_OUTPUT_PROP, WORK_DIR_FILE.toString()+File.separator };
50+
String[] corefArgs = { "-props", "edu/stanford/nlp/coref/hybrid/properties/zh-dcoref-conll.properties",
51+
'-' + HybridCorefProperties.LOG_PROP, baseLogFile,
52+
'-' + CorefProperties.OUTPUT_PATH_PROP, WORK_DIR_FILE.toString()+File.separator };
5353

5454
Properties props = StringUtils.argsToProperties(corefArgs);
5555
System.err.println("Running coref with arguments:");
@@ -107,24 +107,24 @@ public void testChineseDcoref() throws Exception {
107107
Counter<String> highResults = new ClassicCounter<String>();
108108
Counter<String> expectedResults = new ClassicCounter<String>();
109109

110-
setLowHighExpected(lowResults, highResults, expectedResults, MENTION_TP, 12550, 12700, 12600); // In 2015 was: 12370
110+
setLowHighExpected(lowResults, highResults, expectedResults, MENTION_TP, 12550, 12700, 12596); // In 2015 was: 12370
111111
setLowHighExpected(lowResults, highResults, expectedResults, MENTION_F1, 55.7, 56.0, 55.88); // In 2015 was: 55.59
112112

113-
setLowHighExpected(lowResults, highResults, expectedResults, MUC_TP, 6050, 6100, 6063); // In 2015 was: 5958
114-
setLowHighExpected(lowResults, highResults, expectedResults, MUC_F1, 58.30, 58.80, 58.48); // In 2015 was: 57.87
113+
setLowHighExpected(lowResults, highResults, expectedResults, MUC_TP, 6050, 6100, 6065); // In 2015 was: 5958
114+
setLowHighExpected(lowResults, highResults, expectedResults, MUC_F1, 58.30, 58.80, 58.52); // In 2015 was: 57.87
115115

116-
setLowHighExpected(lowResults, highResults, expectedResults, BCUBED_TP, 6990, 7110.00, 7100.92); // In 2015 was: 6936.32
117-
setLowHighExpected(lowResults, highResults, expectedResults, BCUBED_F1, 51.60, 52.00, 51.86); // In 2015 was: 51.07
116+
setLowHighExpected(lowResults, highResults, expectedResults, BCUBED_TP, 6990, 7110.00, 7026.39); // In 2015 was: 6936.32
117+
setLowHighExpected(lowResults, highResults, expectedResults, BCUBED_F1, 51.60, 52.20, 52.11); // In 2015 was: 51.07
118118

119-
setLowHighExpected(lowResults, highResults, expectedResults, CEAFM_TP, 8220, 8260, 8242); // In 2015 was: 8074
120-
setLowHighExpected(lowResults, highResults, expectedResults, CEAFM_F1, 55.50, 56.00, 55.77); // In 2015 was: 55.10
119+
setLowHighExpected(lowResults, highResults, expectedResults, CEAFM_TP, 8220, 8260, 8224); // In 2015 was: 8074
120+
setLowHighExpected(lowResults, highResults, expectedResults, CEAFM_F1, 55.40, 56.00, 55.43); // In 2015 was: 55.10
121121

122-
setLowHighExpected(lowResults, highResults, expectedResults, CEAFE_TP, 2250.00, 2300.00, 2272.52); // In 2015 was: 2205.72
123-
setLowHighExpected(lowResults, highResults, expectedResults, CEAFE_F1, 51.50, 52.00, 51.52); // In 2015 was: 50.62
122+
setLowHighExpected(lowResults, highResults, expectedResults, CEAFE_TP, 2250.00, 2300.00, 2296.06); // In 2015 was: 2205.72
123+
setLowHighExpected(lowResults, highResults, expectedResults, CEAFE_F1, 51.30, 52.00, 51.33); // In 2015 was: 50.62
124124

125-
setLowHighExpected(lowResults, highResults, expectedResults, BLANC_F1, 46.75, 47.25, 47.00); // In 2015 was: 46.19
125+
setLowHighExpected(lowResults, highResults, expectedResults, BLANC_F1, 46.00, 47.25, 46.68); // In 2015 was: 46.19
126126

127-
setLowHighExpected(lowResults, highResults, expectedResults, CONLL_SCORE, 53.75, 54.00, 53.95); // In 2015 was: 53.19
127+
setLowHighExpected(lowResults, highResults, expectedResults, CONLL_SCORE, 53.75, 54.10, 54.01); // In 2015 was: 53.19
128128

129129
BenchmarkingHelper.benchmarkResults(results, lowResults, highResults, expectedResults);
130130
}

itest/src/edu/stanford/nlp/ie/crf/TestThreadedCRFClassifier.java

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import edu.stanford.nlp.util.Timing;
1313

1414
public class TestThreadedCRFClassifier {
15+
1516
TestThreadedCRFClassifier(Properties props) {
1617
inputEncoding = props.getProperty("inputEncoding", "UTF-8");
1718
}
@@ -23,8 +24,8 @@ public class TestThreadedCRFClassifier {
2324

2425
private final String inputEncoding;
2526

26-
CRFClassifier loadClassifier(String loadPath, Properties props) {
27-
CRFClassifier crf = new CRFClassifier(props);
27+
static CRFClassifier loadClassifier(String loadPath, Properties props) {
28+
CRFClassifier crf = new CRFClassifier(props);
2829
crf.loadClassifierNoExceptions(loadPath, props);
2930
return crf;
3031
}
@@ -58,9 +59,9 @@ public void run() {
5859
Timing t = new Timing();
5960
resultsString = runClassifier(crf, filename);
6061
long millis = t.stop();
61-
System.out.println("Thread " + threadName + " took " + millis +
62+
System.out.println("Thread " + threadName + " took " + millis +
6263
"ms to tag file " + filename);
63-
}
64+
}
6465
}
6566

6667
/**
@@ -71,7 +72,7 @@ public void run() {
7172
* -crf2 ../stanford-releases/stanford-ner-models/dewac_175m_600.ser.gz
7273
* -testFile ../data/german-ner/deu.testa -inputEncoding iso-8859-1
7374
*/
74-
static public void main(String[] args) {
75+
public static void main(String[] args) {
7576
try {
7677
System.setOut(new PrintStream(System.out, true, "UTF-8"));
7778
System.setErr(new PrintStream(System.err, true, "UTF-8"));
@@ -81,10 +82,10 @@ static public void main(String[] args) {
8182

8283
runTest(StringUtils.argsToProperties(args));
8384
}
84-
85+
8586
static public void runTest(Properties props) {
8687
TestThreadedCRFClassifier test = new TestThreadedCRFClassifier(props);
87-
test.runThreadedTest(props);
88+
test.runThreadedTest(props);
8889
}
8990

9091

@@ -95,7 +96,7 @@ void runThreadedTest(Properties props) {
9596
ArrayList<String> modelNames = new ArrayList<String>();
9697
ArrayList<CRFClassifier> classifiers = new ArrayList<CRFClassifier>();
9798

98-
for (int i = 1;
99+
for (int i = 1;
99100
props.getProperty("crf" + Integer.toString(i)) != null; ++i) {
100101
String model = props.getProperty("crf" + Integer.toString(i));
101102
CRFClassifier crf = loadClassifier(model, props);
@@ -107,7 +108,7 @@ void runThreadedTest(Properties props) {
107108
// must run twice to account for "transductive learning"
108109
results = runClassifier(crf, testFile);
109110
baseResults.add(results);
110-
System.out.println("Stored base results for " + model +
111+
System.out.println("Stored base results for " + model +
111112
"; length " + results.length());
112113
}
113114

@@ -121,13 +122,13 @@ void runThreadedTest(Properties props) {
121122
String repeated = runClassifier(crf, testFile);
122123
if (!base.equals(repeated)) {
123124
throw new RuntimeException("Repeated unthreaded results " +
124-
"not the same for " + model +
125+
"not the same for " + model +
125126
" run on file " + testFile);
126127
}
127128
}
128129

129130
// test the first classifier in several simultaneous threads
130-
int numThreads = PropertiesUtils.getInt(props, "simThreads",
131+
int numThreads = PropertiesUtils.getInt(props, "simThreads",
131132
DEFAULT_SIM_THREADS);
132133

133134
ArrayList<CRFThread> threads = new ArrayList<CRFThread>();
@@ -148,11 +149,11 @@ void runThreadedTest(Properties props) {
148149
System.out.println("Yay!");
149150
} else {
150151
throw new RuntimeException("Results not equal when running " +
151-
modelNames.get(0) + " under " +
152+
modelNames.get(0) + " under " +
152153
numThreads + " simultaneous threads");
153154
}
154155
}
155-
156+
156157
// test multiple classifiers (if given) in multiple threads each
157158
if (classifiers.size() > 1) {
158159
numThreads = PropertiesUtils.getInt(props, "multipleThreads",
@@ -162,11 +163,11 @@ void runThreadedTest(Properties props) {
162163
int classifierNum = i % classifiers.size();
163164
int repeatNum = i / classifiers.size();
164165
threads.add(new CRFThread(classifiers.get(classifierNum), testFile,
165-
("Simultaneous-" + classifierNum +
166+
("Simultaneous-" + classifierNum +
166167
"-" + repeatNum)));
167168
}
168-
for (int i = 0; i < threads.size(); ++i) {
169-
threads.get(i).start();
169+
for (CRFThread thread : threads) {
170+
thread.start();
170171
}
171172
for (int i = 0; i < threads.size(); ++i) {
172173
int classifierNum = i % classifiers.size();
@@ -182,16 +183,17 @@ void runThreadedTest(Properties props) {
182183
System.out.println("Yay!");
183184
} else {
184185
throw new RuntimeException("Results not equal when running " +
185-
modelNames.get(classifierNum) +
186-
" under " + numThreads +
186+
modelNames.get(classifierNum) +
187+
" under " + numThreads +
187188
" threads with " +
188-
classifiers.size() +
189+
classifiers.size() +
189190
" total classifiers");
190191
}
191-
}
192+
}
192193
}
193194

194195
// if no exceptions thrown, great success
195196
System.out.println("Everything worked!");
196197
}
198+
197199
}

itest/src/edu/stanford/nlp/ie/crf/ThreadedCRFClassifierITest.java

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,30 +4,33 @@
44

55
import java.util.Properties;
66

7-
/**
7+
/**
88
* Test that the CRFClassifier works when multiple classifiers are run
99
* in multiple threads.
1010
*
1111
* @author John Bauer
1212
*/
1313
public class ThreadedCRFClassifierITest extends TestCase {
14+
1415
Properties props;
1516

16-
private String german1 =
17-
"/u/nlp/data/ner/goodClassifiers/german.hgc_175m_600.crf.ser.gz";
18-
private String german2 =
17+
private static final String german1 =
18+
"edu/stanford/nlp/models/ner/german.conll.hgc_175m_600.crf.ser.gz";
19+
/** -- We're no longer supporting this one
20+
private String german2 =
1921
"/u/nlp/data/ner/goodClassifiers/german.dewac_175m_600.crf.ser.gz";
20-
private String germanTestFile = "/u/nlp/data/german/ner/deu.testa";
22+
*/
23+
private static final String germanTestFile = "/u/nlp/data/german/ner/2016/deu.utf8.testa";
2124

22-
private String english1 =
25+
private static final String english1 =
2326
"/u/nlp/data/ner/goodClassifiers/english.all.3class.nodistsim.crf.ser.gz";
24-
private String english2 =
25-
"/u/nlp/data/ner/goodClassifiers/english.all.3class.distsim.crf.ser.gz";
26-
private String englishTestFile = "/u/nlp/data/ner/column_data/conll.testa";
27+
private static final String english2 =
28+
"/u/nlp/data/ner/goodClassifiers/english.conll.4class.distsim.crf.ser.gz";
29+
private static final String englishTestFile = "/u/nlp/data/ner/column_data/conll.4class.testa";
30+
31+
private static final String germanEncoding = "utf-8";
32+
private static final String englishEncoding = "utf-8";
2733

28-
private String germanEncoding = "iso-8859-1";
29-
private String englishEncoding = "utf-8";
30-
3134
@Override
3235
public void setUp() {
3336
props = new Properties();
@@ -47,12 +50,13 @@ public void testOneGermanCRF() {
4750
TestThreadedCRFClassifier.runTest(props);
4851
}
4952

50-
public void testTwoGermanCRFs() {
51-
props.setProperty("crf1", german1);
52-
props.setProperty("crf2", german2);
53-
props.setProperty("testFile", germanTestFile);
54-
props.setProperty("inputEncoding", germanEncoding);
53+
public void testTwoEnglishCRFs() {
54+
props.setProperty("crf1", english1);
55+
props.setProperty("crf2", english2);
56+
props.setProperty("testFile", englishTestFile);
57+
props.setProperty("inputEncoding", englishEncoding);
5558
TestThreadedCRFClassifier.runTest(props);
5659
}
60+
5761
}
5862

0 commit comments

Comments
 (0)