Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

More integration from Dipanjan, plus with annotations API too

  • Loading branch information...
commit 1e3cb7f9e09ca01ce4a480d53d8e2a2fee3ad982 1 parent f281fa5
@brendano brendano authored
Showing with 568 additions and 129 deletions.
  1. +13 −0 .gitignore
  2. +227 −0 LICENSE
  3. +26 −0 README
  4. +0 −45 README.md
  5. BIN  lib_twitter/guava-r09.jar
  6. BIN  lib_twitter/lucene-core-3.0.3.jar
  7. BIN  lib_twitter/text-0.1.0.jar
  8. BIN  lib_twitter/twitter-text-1.1.8.jar
  9. +1 −1  runTagger.sh
  10. +8 −4 scripts/classwrap.sh
  11. +1 −1  scripts/compile.sh
  12. +11 −0 scripts/make_jar.sh
  13. +2 −1  src/edu/cmu/cs/lti/ark/ssl/pos/POSFeatureTemplates.java
  14. +53 −28 src/edu/cmu/cs/lti/ark/ssl/pos/SemiSupervisedPOSTagger.java
  15. +1 −1  src/edu/cmu/cs/lti/ark/ssl/pos/TagDictionary.java
  16. +4 −1 src/edu/cmu/cs/lti/ark/ssl/util/ProduceInterpolatedMultinomials.java
  17. +4 −1 src/edu/cmu/cs/lti/ark/ssl/util/ProduceInterpolatedMultinomialsTagDictionary.java
  18. +19 −30 src/edu/cmu/cs/lti/ark/tweetnlp/RunPOSTagger.java
  19. +87 −0 src/edu/cmu/cs/lti/ark/tweetnlp/TweetTaggerInstance.java
  20. +14 −10 src/edu/cmu/cs/lti/ark/tweetnlp/twitter_anno/CMUPOSAttribute.java
  21. +70 −0 src/edu/cmu/cs/lti/ark/tweetnlp/twitter_anno/CMUPOSAttributeImpl.java
  22. +15 −5 src/edu/cmu/cs/lti/ark/tweetnlp/twitter_anno/CMUPOSTagger.java
  23. +6 −1 src/edu/cmu/cs/lti/ark/tweetnlp/twitter_anno/MyTokenizerUsageExample.java
  24. +6 −0 src/edu/cmu/cs/lti/ark/tweetnlp/twokenize.scala
View
13 .gitignore
@@ -0,0 +1,13 @@
+.svn
+
+.idea
+*.iml
+out
+
+.settings
+.project
+.classpath
+.checkstyle
+
+mybuild
+tokenizer_development
View
227 LICENSE
@@ -0,0 +1,227 @@
+Everything is licensed under the Apache License version 2.0:
+http://www.apache.org/licenses/LICENSE-2.0
+
+edu.cmu.cs.lti.ark.ssl is Copyright 2011, Dipanjan Das.
+
+posBerkeley.jar is Copyright 2011, Taylor Berg-Kirkpatrick. Licensed as Apache 2.0:
+
+ From: Taylor Berg-Kirkpatrick
+ Date: Thu, 30 Jun 2011 15:33:03 -0700
+ Subject: Re: license for your code
+ To: Kevin Gimpel
+
+ Sure. I give you guys permission to release under Apache.
+
+ On Tue, Jun 28, 2011 at 11:50 AM, Kevin Gimpel <kgimpel@cs.cmu.edu> wrote:
+ [...]
+ > Dipanjan, Brendan, and I are working on the release of our group's Twitter
+ > part-of-speech tagger and we are now trying to figure out what license to
+ > use with the code that we are releasing. We extended code that you shared
+ > with Noah last summer
+ [...]
+ > But in order to do this, we'd have to ask you to "release" that same code
+ > you gave to us under an Apache license.
+
+=============================================================================
+
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright [yyyy] [name of copyright owner]
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
View
26 README
@@ -0,0 +1,26 @@
+CMU ARK Twitter Part-of-Speech Tagger
+http://www.ark.cs.cmu.edu/TweetNLP/
+
+Licensed under Apache 2.0 (see LICENSE file).
+
+Requires Java 6. To run the tagger:
+
+ ./runTagger.sh -input example_tweets.txt -output tagged_tweets.txt
+
+To build from source:
+ scripts/compile.sh
+
+To train and evalute the tagger, see:
+ src/edu/cmu/cs/lti/ark/ssl/pos/SemiSupervisedPOSTagger.java
+
+Directories
+-----------
+ * runTagger.sh is the script you probably want
+ * lib/ has runtime dependencies
+ * lib_build/ has buildtime dependency
+ * lib_twitter/ has the twitter-text and annotations API
+ * scripts/ helps you build and run
+ * src/ has the actual source code (mostly java, and one bit of scala)
+
+The lib_twitter stuff is optional; if you want to use it, you may wish to
+update to a newer version of the library.
View
45 README.md
@@ -1,45 +0,0 @@
-THIS IS NOT DONE YET - if you run it, it will give JUNK tags
-
-http://www.ark.cs.cmu.edu/TweetNLP/
-
-
-To run the tagger
------------------
-
-The only dependency is Java 6. Run:
-
- ./runTagger.sh -input example_tweets.txt -output tagged_tweets.txt
-
-And, `./runTagger.sh -help` gives an overview of commandline options.
-
-
-To build from source
---------------------
-
- scripts/compile.sh
-
-
-To train and evalute the tagger
--------------------------------
-
-(something very different, can be messy since no one will ever do it)
-
-
-Directories
------------
- * runTagger.sh is the script you probably want
-
- * lib/ has runtime dependencies
- * lib_build/ has buildtime dependency
- * scripts/ helps you build and run
- * src/ has the actual source code (mostly java, and one bit of scala)
-
-
-IDE notes
----------
-
-We include the Scala library and compiler to hopefully make a fresh build
-painless. We've also used Eclipse and IDEA too. You have to either install
-the appropriate Scala plugin, or just compile the Scala separately
-(`compile_scala.sh`) use the IDE's Java support for everything else.
-
View
BIN  lib_twitter/guava-r09.jar
Binary file not shown
View
BIN  lib_twitter/lucene-core-3.0.3.jar
Binary file not shown
View
BIN  lib_twitter/text-0.1.0.jar
Binary file not shown
View
BIN  lib_twitter/twitter-text-1.1.8.jar
Binary file not shown
View
2  runTagger.sh
@@ -1,3 +1,3 @@
#!/bin/bash
-$(dirname $0)/scripts/classwrap.sh edu.cmu.cs.lti.ark.tweetnlp.RunPOSTagger "$@"
+$(dirname $0)/scripts/classwrap.sh -Xmx2g edu.cmu.cs.lti.ark.tweetnlp.RunPOSTagger "$@"
View
12 scripts/classwrap.sh
@@ -4,13 +4,17 @@
set -eu
root=$(dirname $0)/..
-cp=$root/bin # Eclipse
-cp=$cp:$(print $root/out/production/*/ | tr ' ' :) # IDEA
-cp=$cp:$root/mybuild # Our own build dir
+
+cp=""
+# Eclipse and IDEA defaults
+cp=$cp:$root/bin
+cp=$cp:$(print $root/out/production/*/ | tr ' ' :)
+# our build dir
+cp=$cp:$root/mybuild
cp=$cp:$(print $root/lib/*.jar | tr ' ' :)
+# Twitter Commons text library stuff
cp=$cp:$(print $root/lib_twitter/*.jar | tr ' ' :)
-# set -x
exec java -cp "$cp" "$@"
View
2  scripts/compile.sh
@@ -16,7 +16,7 @@ mkdir -p mybuild
scripts/compile_scala.sh
-javac -cp mybuild:$(echo lib/*.jar|tr ' ' :) -d mybuild src/**/*.java
+javac -cp mybuild:$(echo {lib,lib_twitter}/*.jar | tr ' ' :) -d mybuild src/**/*.java
set +x
echo "All the .class files now in $(pwd)/mybuild"
View
11 scripts/make_jar.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+scripts/compile.sh
+(cd mybuild && jar cf ../lib/ark-tweet-nlp.jar *)
+
+exit
+
+# for release...
+# rm -rf mybuild
+# rm -rf lib_build
+# rm -rf .git
View
3  src/edu/cmu/cs/lti/ark/ssl/pos/POSFeatureTemplates.java
@@ -10,6 +10,7 @@
import java.util.List;
import java.util.Map;
import java.util.Set;
+import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.commons.codec.language.Metaphone;
@@ -24,7 +25,7 @@
*/
public class POSFeatureTemplates {
- private static Logger log = Logger.getLogger(POSFeatureTemplates.class.getCanonicalName());
+ public static Logger log = Logger.getLogger(POSFeatureTemplates.class.getCanonicalName());
public interface EmitFeatureTemplate {
View
81 src/edu/cmu/cs/lti/ark/ssl/pos/SemiSupervisedPOSTagger.java
@@ -48,7 +48,7 @@
*/
private static final long serialVersionUID = 481162207516110632L;
- private static Logger log = Logger.getLogger(SemiSupervisedPOSTagger.class.getCanonicalName());
+ public static Logger log = Logger.getLogger(SemiSupervisedPOSTagger.class.getCanonicalName());
public static Random baseRand = new Random(43569);
public static Random[] rands;
@@ -89,7 +89,7 @@
private String testSet;
private String trainOrTest;
private String modelFile;
- private String runOutput;
+ private String runOutput = null;
private int numLabeledSentences;
private int numUnLabeledSentences;
private int maxSentenceLength;
@@ -177,8 +177,8 @@ public SemiSupervisedPOSTagger(POSOptions options0) {
parser = options.parser;
setVariousOptions();
createExecutionDirectory();
- }
-
+ }
+
private void createExecutionDirectory() {
long timeStamp = new Date().getTime();
File dir = new File(execPoolDir + "/" + timeStamp);
@@ -415,6 +415,11 @@ private void setVariousOptions() {
}
modelFile = (String) parser.getOptionValue(options.modelFile);
runOutput = (String) parser.getOptionValue(options.runOutput);
+ if (runOutput != null && !runOutput.equals("null")) {
+
+ } else {
+ runOutput = null;
+ }
if (!useOnlyUnlabeledData) {
numLabeledSentences = (Integer) parser.getOptionValue(options.numLabeledSentences);
}
@@ -602,7 +607,7 @@ private void getHelperTransitions(String[] paths) {
private String[] getNames() {
System.out.println("Reading names file...");
- String namesFile = "../lib/names";
+ String namesFile = "lib/names";
BufferedReader bReader =
BasicFileIO.openFileToRead(namesFile);
String line = BasicFileIO.getLine(bReader);
@@ -621,7 +626,7 @@ private void getHelperTransitions(String[] paths) {
private Map<String, double[]> readDistSim() {
System.out.println("Reading embeddings file...");
- String distSimFile = "../lib/embeddings.txt";
+ String distSimFile = "lib/embeddings.txt";
BufferedReader bReader =
BasicFileIO.openFileToRead(distSimFile);
String line = BasicFileIO.getLine(bReader);
@@ -731,7 +736,7 @@ private void logInputInfo() {
}
- private void initializeDataStructures() {
+ public void initializeDataStructures() {
featureIndexCounts = new ArrayList<Integer>();
indexToPOS = new ArrayList<String>();
indexToWord = new ArrayList<String>();
@@ -1791,9 +1796,9 @@ public void test() {
testFeatureHMM(sequences);
}
}
-
- public void testCRF(Collection<Pair<List<String>, List<String>>> sequences) {
- POSModel model = (POSModel) BasicFileIO.readSerializedObject(modelFile);
+
+ public List<List<String>> testCRF(Collection<Pair<List<String>, List<String>>> sequences,
+ POSModel model) {
featureIndexCounts = model.getFeatureIndexCounts();
featureToIndex = model.getFeatureToIndex();
indexToFeature = model.getIndexToFeature();
@@ -1809,8 +1814,8 @@ public void testCRF(Collection<Pair<List<String>, List<String>>> sequences) {
posToIndex);
lObservations = pairList.getFirst();
goldLabels = pairList.getSecond();
- logObservationInfo();
- logInputInfo();
+ // logObservationInfo();
+ // logInputInfo();
if (useStackedFeatures) {
Collection<Pair<List<String>, List<String>>> stackedSequences
= TabSeparatedFileReader.readPOSSeqences(stackedFile,
@@ -1870,25 +1875,45 @@ public void testCRF(Collection<Pair<List<String>, List<String>>> sequences) {
Inference inf = new Inference(numLabels, vertexExtractor, edgeExtractor);
double total = 0.0;
double correct = 0.0;
-
- BufferedWriter bWriter = BasicFileIO.openFileToWrite(runOutput);
- for (int i = 0; i < lObservations.length; i ++) {
- List<Integer> tags = posteriorDecode(
- lObservations[i], inf, largerSetOfWeights);
- // List<Integer> tags = getViterbiLabelSequence(lObservations[i], inf, largerSetOfWeights);
- for (int j = 0; j < goldLabels[i].length; j++) {
- if(goldLabels[i][j] == tags.get(j)) {
- correct++;
+ if (runOutput != null) {
+ BufferedWriter bWriter = BasicFileIO.openFileToWrite(runOutput);
+ for (int i = 0; i < lObservations.length; i ++) {
+ List<Integer> tags = posteriorDecode(
+ lObservations[i], inf, largerSetOfWeights);
+ // List<Integer> tags = getViterbiLabelSequence(lObservations[i], inf, largerSetOfWeights);
+ for (int j = 0; j < goldLabels[i].length; j++) {
+ if(goldLabels[i][j] == tags.get(j)) {
+ correct++;
+ }
+ total++;
+ BasicFileIO.writeLine(bWriter,
+ indexToWord.get(lObservations[i][j]) +
+ "\t" + indexToPOS.get(tags.get(j)));
}
- total++;
- BasicFileIO.writeLine(bWriter,
- indexToWord.get(lObservations[i][j]) +
- "\t" + indexToPOS.get(tags.get(j)));
+ BasicFileIO.writeLine(bWriter, "");
}
- BasicFileIO.writeLine(bWriter, "");
+ log.info("Accuracy:" + (correct / total));
+ BasicFileIO.closeFileAlreadyWritten(bWriter);
+ } else {
+ ArrayList<List<String>> col =
+ new ArrayList<List<String>>();
+ for (int i = 0; i < lObservations.length; i ++) {
+ List<Integer> tags = posteriorDecode(
+ lObservations[i], inf, largerSetOfWeights);
+ List<String> list = new ArrayList<String>();
+ for (int j = 0; j < goldLabels[i].length; j++) {
+ list.add(indexToPOS.get(tags.get(j)));
+ }
+ col.add(list);
+ }
+ return col;
}
- log.info("Accuracy:" + (correct / total));
- BasicFileIO.closeFileAlreadyWritten(bWriter);
+ return null;
+ }
+
+ public void testCRF(Collection<Pair<List<String>, List<String>>> sequences) {
+ POSModel model = (POSModel) BasicFileIO.readSerializedObject(modelFile);
+ testCRF(sequences, model);
}
public List<Integer> posteriorDecode(int[] s, Inference inf, double[] w) {
View
2  src/edu/cmu/cs/lti/ark/ssl/pos/TagDictionary.java
@@ -19,7 +19,7 @@ public TagDictionary() {
public static TagDictionary instance() {
if (_instance == null) {
_instance = new TagDictionary();
- _instance.loadData("../lib/tagdict.txt");
+ _instance.loadData("lib/tagdict.txt");
}
return _instance;
}
View
5 src/edu/cmu/cs/lti/ark/ssl/util/ProduceInterpolatedMultinomials.java
@@ -111,7 +111,10 @@ public static void main(String[] args) {
}
}
- String[] sortedArray = Arrays.copyOf(validTagArray, validTagArray.length);
+ String[] sortedArray = new String[validTagArray.length];
+ for (int s = 0; s < validTagArray.length; s++) {
+ sortedArray[s] = new String(validTagArray[s]);
+ }
Arrays.sort(sortedArray);
System.out.println("Sorted array:");
for (String str: sortedArray) {
View
5 src/edu/cmu/cs/lti/ark/ssl/util/ProduceInterpolatedMultinomialsTagDictionary.java
@@ -107,7 +107,10 @@ public static void main(String[] args) {
}
}
- String[] sortedArray = Arrays.copyOf(validTagArray, validTagArray.length);
+ String[] sortedArray = new String[validTagArray.length];
+ for (int s = 0; s < validTagArray.length; s++) {
+ sortedArray[s] = new String(validTagArray[s]);
+ }
Arrays.sort(sortedArray);
System.out.println("Sorted array:");
for (String str: sortedArray) {
View
49 src/edu/cmu/cs/lti/ark/tweetnlp/RunPOSTagger.java
@@ -29,42 +29,30 @@
public static String output = null;
@Option(gloss="conll = one token per line, blank lines separating tweets.")
- public static String output_format = "conll";
+ public static String format = "conll";
}
+ private static TweetTaggerInstance ttInstance = null;
+
/** Returns list of tags, one per token, parallel to the input tokens. */
- public static List<String> doPOSTagging(List<String> toks, SemiSupervisedPOSTagger tagger) {
- // TODO please replace this
- return dummyTagging(toks);
+ public static List<String> doPOSTagging(List<String> toks) {
+ return tweetTagging(toks);
}
- public static List<String> dummyTagging(List<String> toks) {
- ArrayList<String> tags = new ArrayList();
- for (String tok : toks) tags.add("N");
- return tags;
+
+ public static List<String> tweetTagging(List<String> toks) {
+ return TweetTaggerInstance.getInstance().getTagsForOneSentence(toks);
}
+
public static void main(String[] args) throws Exception {
OptionsParser op = new OptionsParser(Opts.class);
op.doParse(args);
-// System.out.println("OPTIONS:\n" + op.doGetOptionPairs());
+ if (Opts.input == null) {
+ op.printHelp();
+ return;
+ }
if (Opts.input.equals("-")) throw new RuntimeException("stdin unimplemented");
-
-
- System.out.println("Loading POS tagger.");
- // TODO dipanjan please help :)
- // If you keep it, be aware in its current state it doesn't seem to work, it needs more tweaking.
- // Right now it doesn't work anyways.
- // One big thing is, --testSet is not a good way to go.
- // This code needs to be in control of feeding in sentences.
- // This string-arg list is a horrible hack, feel free to get rid of.
- String[] posOptionArgs = new String[]{
- "--trainOrTest","test"
- // ... more ...
- };
- POSOptions posOptions = new POSOptions(posOptionArgs);
-// SemiSupervisedPOSTagger tagger = new SemiSupervisedPOSTagger(posOptions);
- SemiSupervisedPOSTagger tagger = null;
-
+
System.out.println("Tagging tweets from file: " + Opts.input);
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(Opts.input), "UTF-8"));
@@ -72,15 +60,16 @@ public static void main(String[] args) throws Exception {
String line;
while((line = reader.readLine()) != null) {
List<String> toks = Twokenize.tokenizeForTagger_J(line);
- List<String> tags = doPOSTagging(toks, tagger);
-
- if (Opts.output_format.equals("conll")) {
+ List<String> tags = doPOSTagging(toks);
+ if (Opts.format.equals("conll")) {
for (int i=0; i < toks.size(); i++) {
writer.write(toks.get(i) + "\t" + tags.get(i) + "\n");
}
writer.write("\n");
writer.flush();
- }
+ } else {
+ throw new RuntimeException("Unknown output format " + Opts.format);
+ }
}
}
}
View
87 src/edu/cmu/cs/lti/ark/tweetnlp/TweetTaggerInstance.java
@@ -0,0 +1,87 @@
+package edu.cmu.cs.lti.ark.tweetnlp;
+
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+import java.util.logging.Level;
+
+import edu.cmu.cs.lti.ark.ssl.pos.POSFeatureTemplates;
+import edu.cmu.cs.lti.ark.ssl.pos.POSModel;
+import edu.cmu.cs.lti.ark.ssl.pos.POSOptions;
+import edu.cmu.cs.lti.ark.ssl.pos.SemiSupervisedPOSTagger;
+import edu.cmu.cs.lti.ark.ssl.util.BasicFileIO;
+import fig.basic.Pair;
+
+/** Wraps SemiSupervisedPOSTagger for easier inference-only usage (i.e. to tag new sentences) */
+public class TweetTaggerInstance {
+ private SemiSupervisedPOSTagger tagger = null;
+ private POSModel model = null;
+
+ private static TweetTaggerInstance ttInstance;
+
+ public static TweetTaggerInstance getInstance() {
+ if (ttInstance == null) {
+ ttInstance = new TweetTaggerInstance();
+ }
+ return ttInstance;
+ }
+
+ private TweetTaggerInstance() {
+ List<String> argList = new ArrayList<String>();
+ argList.add("--trainOrTest");
+ argList.add("test");
+ argList.add("--useGlobalForLabeledData");
+ argList.add("--useStandardMultinomialMStep");
+ argList.add("--useStandardFeatures");
+ argList.add("--regularizationWeight");
+ argList.add("0.707");
+ argList.add("--regularizationBias");
+ argList.add("0.0");
+ argList.add("--initialWeightsLower");
+ argList.add("-0.01");
+ argList.add("--initialWeightsUpper");
+ argList.add("0.01");
+ argList.add("--iters");
+ argList.add("1000");
+ argList.add("--printRate");
+ argList.add("100");
+ argList.add("--execPoolDir");
+ argList.add("/tmp");
+ argList.add("--modelFile");
+ argList.add("lib/tweetpos.model");
+ argList.add("--useDistSim");
+ argList.add("--useNames");
+ argList.add("--numLabeledSentences");
+ argList.add("100000");
+ argList.add("--maxSentenceLength");
+ argList.add("200");
+ String[] args = new String[argList.size()];
+ argList.toArray(args);
+ POSOptions options = new POSOptions(args);
+ options.parseArgs(args);
+ tagger = new SemiSupervisedPOSTagger(options);
+ model = (POSModel) BasicFileIO.readSerializedObject("lib/tweetpos.model");
+ tagger.initializeDataStructures();
+
+ POSFeatureTemplates.log.setLevel(Level.WARNING);
+ SemiSupervisedPOSTagger.log.setLevel(Level.WARNING);
+ }
+
+ public List<String> getTagsForOneSentence(List<String> words) {
+ // BTO: i don't get this, does tagger.testCRF need a dummy list or something? can we delete?
+ ArrayList<String> dTags = new ArrayList<String>();
+ for (String tok : words) {
+ dTags.add("N");
+ }
+
+ List<Pair<List<String>, List<String>>> col =
+ new ArrayList<Pair<List<String>, List<String>>>();
+ col.add(new Pair<List<String>, List<String>>(words, dTags));
+ List<List<String>> col1 = tagger.testCRF(col, model);
+ if (col1.size() != 1) {
+ throw new RuntimeException("Problem with the returned size of the collection. Should be 1.");
+ }
+ List<String> tags = col1.get(0);
+ return tags;
+ }
+}
View
24 src/edu/cmu/cs/lti/ark/tweetnlp/twitter_anno/CMUPOSAttribute.java
@@ -2,18 +2,22 @@
import org.apache.lucene.util.Attribute;
-public class CMUPOSAttribute implements Attribute {
+public interface CMUPOSAttribute extends Attribute {
- /** One-character tagname -- the official format as seen in the annotated training data.
- * We could move to an enum but that would be more work to maintain.
+
+ /**
+ * One-character tagname -- the official format as seen in the annotated training data.
+ * We should move to an enum but that would be more work to maintain.
+ */
+ String getTag();
+
+ /**
+ * The token (just a string) which this tag tags
*/
- public String tag;
+ String getToken();
- /** The token (just a string) which this tag tags */
- public String token;
+ void setToken(String token);
+ void setTag(String tag);
- public CMUPOSAttribute(String token, String tag) {
- this.token = token;
- this.tag = tag;
- }
+ Object clone();
}
View
70 src/edu/cmu/cs/lti/ark/tweetnlp/twitter_anno/CMUPOSAttributeImpl.java
@@ -0,0 +1,70 @@
+package edu.cmu.cs.lti.ark.tweetnlp.twitter_anno;
+
+import org.apache.lucene.util.AttributeImpl;
+
+/** warning, the boilerplate-like inherited methods haven't been tested much -BTO */
+public class CMUPOSAttributeImpl extends AttributeImpl implements CMUPOSAttribute {
+ private String tag;
+
+ private String token;
+
+ public CMUPOSAttributeImpl() {
+ System.out.println("construct");
+ }
+
+ public CMUPOSAttributeImpl(String token, String tag) {
+ this.setToken(token);
+ this.setTag(tag);
+ }
+
+
+ public String getTag() {
+ return tag;
+ }
+
+ public void setTag(String tag) {
+ this.tag = tag;
+ }
+
+ public String getToken() {
+ return token;
+ }
+
+ public void setToken(String token) {
+ this.token = token;
+ }
+
+ @Override
+ public void clear() {
+ this.token = null;
+ this.tag = null;
+ }
+
+ @Override
+ public void copyTo(AttributeImpl target) {
+ if (target instanceof CMUPOSAttributeImpl) {
+ ((CMUPOSAttributeImpl) target).setTag(getTag());
+ ((CMUPOSAttributeImpl) target).setToken(getToken());
+ }
+ }
+
+ @Override
+ public boolean equals(Object other) {
+ return other != null
+ && other instanceof CMUPOSAttributeImpl
+ && ((CMUPOSAttributeImpl) other).tag == this.tag;
+ }
+
+ @Override
+ public int hashCode() {
+ return getTag().hashCode();
+ }
+
+ public Object clone() {
+ CMUPOSAttributeImpl result = (CMUPOSAttributeImpl) super.clone();
+ return result;
+ }
+
+
+
+}
View
20 src/edu/cmu/cs/lti/ark/tweetnlp/twitter_anno/CMUPOSTagger.java
@@ -1,6 +1,7 @@
package edu.cmu.cs.lti.ark.tweetnlp.twitter_anno;
import com.twitter.common.text.token.TokenStream;
+import edu.cmu.cs.lti.ark.tweetnlp.TweetTaggerInstance;
import edu.cmu.cs.lti.ark.tweetnlp.Twokenize;
import org.apache.lucene.util.Attribute;
@@ -23,7 +24,7 @@
private int tokenIndex = -1;
public CMUPOSTagger() {
- addAttribute(CMUPOSAttribute.class); // WTF does this do?
+ this.posAttr = addAttribute(CMUPOSAttribute.class); // WTF does this do?
}
@Override
@@ -33,7 +34,10 @@ public boolean incrementToken() {
return false;
}
- posAttr = new CMUPOSAttribute(tweetTokens.get(tokenIndex), tweetTags.get(tokenIndex));
+ posAttr.setTag(tweetTags.get(tokenIndex));
+ posAttr.setToken(tweetTokens.get(tokenIndex));
+
+ tokenIndex++;
return true;
}
@@ -41,10 +45,16 @@ public boolean incrementToken() {
@Override
public void reset(CharSequence input) {
this.tweetTokens = Twokenize.tokenizeForTagger_J(input.toString());
- this.tweetTags = dummyTagging(tweetTokens);
+ this.tweetTags = doTagging(tweetTokens);
+ this.tokenIndex = 0;
+ }
+
+ private List<String> doTagging(List<String> toks) {
+ return TweetTaggerInstance.getInstance().getTagsForOneSentence(toks);
+// return dummyTagging(toks);
}
- public static List<String> dummyTagging(List<String> toks) {
+ private static List<String> dummyTagging(List<String> toks) {
ArrayList<String> tags = new ArrayList<String>();
for (String tok : toks) tags.add("N");
return tags;
@@ -56,7 +66,7 @@ public static void main(String[] args) {
stream.reset("This is what I want to tag.");
while (stream.incrementToken()) {
CMUPOSAttribute posAttribute = stream.getAttribute(CMUPOSAttribute.class);
- System.out.printf("token= %s | POS= %s\n", posAttribute.token, posAttribute.tag);
+ System.out.printf("token= %s \t| POS= %s\n", posAttribute.getToken(), posAttribute.getTag());
}
}
}
View
7 src/edu/cmu/cs/lti/ark/tweetnlp/twitter_anno/MyTokenizerUsageExample.java
@@ -34,7 +34,7 @@ public static void main(String[] args) {
// TokenStream stream = tokenizer.getDefaultTokenStream();
// BTO: above turns out to be a TokenizedCharSequenceStream
- TokenStream stream = new CMUPOSTagger();
+ TokenStream stream = new CMUPOSTagger();
// We're going to ask the token stream what type of attributes it makes available. "Attributes"
// can be understood as "annotations" on the original text.
@@ -56,13 +56,18 @@ public static void main(String[] args) {
// Now we're going to consume tokens from the stream.
int tokenCnt = 0;
while (stream.incrementToken()) {
+ // TODO these all don't work
+
+
// CharSequenceTermAttribute holds the actual token text. This is preferred over
// TermAttribute because it avoids creating new String objects.
+
CharSequenceTermAttribute termAttribute = stream
.getAttribute(CharSequenceTermAttribute.class);
// OffsetAttribute holds indexes into the original String that the current token occupies.
// The startOffset is character position is inclusive, the endOffset is exclusive.
+
OffsetAttribute offsetAttribute = stream.getAttribute(OffsetAttribute.class);
// TokenTypeAttribute holds, as you'd expect, the type of the token.
View
6 src/edu/cmu/cs/lti/ark/tweetnlp/twokenize.scala
@@ -53,6 +53,12 @@ import scala.collection.JavaConversions._
June 2011
*/
+/**
+ * TODO
+ * - byte offsets should be added here. can easily re-align
+ * since the only munged characters are whitespace (hopefully)
+ */
+
import scala.util.matching.Regex
import collection.JavaConversions._
Please sign in to comment.
Something went wrong with that request. Please try again.