kkrugler
(Ken Krugler)
- You’re not logged in!
- Login
- Pricing & Signup
- Name
- Ken Krugler
- Website/Blog
- http://ken-blog.krugler.org
- Company
- TransPac Software, Inc.
- Member Since
- Mar 23, 2009 (8 months)
- 1 public repo
- 0 followers
Following 0 githubbers and watching 4 repositories view all →
Public Repositories (1)
Public Activity 
HEAD is 1d0f0304df0018f85b9fd6160865a035d6afff9c
x
-
Ken Krugler
committed
1d0f0304:
Add test for valid fetched datum
-
Ken Krugler
committed
09eb76c6:
Improved error reporting while parsing robots.txt
-
Ken Krugler
committed
03fbde6f:
Rolled in Vivek's mods to support language detection:
-
Ken Krugler
committed
dee31888:
Moved ParseFunction in as a sub-class.
HEAD is fe355225f615d0a03e90c4c0da675665156924c3
x
-
Ken Krugler
committed
fe355225:
New utility routines.
HEAD is 597f0d87399b93a2decbf53913d6d50099ba6cb9
x
-
Ken Krugler
committed
597f0d87:
Cleanup
-
Ken Krugler
committed
a740e356:
Clarified when skipped-by-scorer could be used
-
Ken Krugler
committed
70584ee4:
Make sure last status is set to current time versus 0.
-
Ken Krugler
committed
b4404bfb:
Rename of pipe for clarity.
-
Ken Krugler
committed
1c4e937f:
Fixed bug w/use of explicit map type for metadata in BaseDatum.
-
Ken Krugler
committed
e8692429:
Added utility Splitter sub-assembly.
-
Ken Krugler
committed
9f722f24:
Support optional crawler version number.
HEAD is 386a09d895678015621d59bdefc5f64e17ebb412
x
-
Ken Krugler
committed
386a09d8:
First cut at simple URL loader that handles text files.
-
Ken Krugler
committed
4cfbf80f:
Cleaned up constants
-
Ken Krugler
committed
4cc9104d:
Added handy metadata constants.
-
Ken Krugler
committed
c064b4b4:
Rolled in Vivek's diff that removes a bunch of warnings.
-
Ken Krugler
committed
040f2562:
Added findAllSubdirs(), which can be used to build a combo set of pipes for processing
HEAD is fc770135e3d01e7982ee25da5c79028f1f5ce19f
x
-
Ken Krugler
committed
fc770135:
BIXO-31: Switch from LOG to LOGGER.
HEAD is 8efb04b0c9fa5f75cdf6b68ad3f369521b96a359
x
-
Ken Krugler
committed
8efb04b0:
Merge branch 'master' into release
-
Ken Krugler
committed
871521a2:
Try to re-add files with correct names
-
Ken Krugler
committed
b1e6cffd:
Arghhh - still trying to clean up rename mess
-
Ken Krugler
committed
e1a2188d:
Move into main (got added to test by accident)
-
Ken Krugler
committed
b52e4ad3:
Change msg used during tagging of release.
-
Ken Krugler
committed
f36b73c6:
Fixed some typos
-
Ken Krugler
committed
1de8362e:
Re-add files with the correct names
-
Ken Krugler
committed
f8fabb47:
Try to clean up rename mess by getting rid of files with multiple cases.
New tag is at emi/bixo/tree/0.4.7
HEAD is 42e6be876bdb6d9995f2ef95380bbb8c41d0b9fc
x
-
Ken Krugler
committed
42e6be87:
Turned on case-sensitivity so that git "knows" about name changes
HEAD is 30cdde5bf04d629c9305cbb67781eabcc3886f0c
x
-
Ken Krugler
committed
03285954:
Update to use new StringUtils support.
-
Ken Krugler
committed
43da7ff6:
Added some documentation on the helpful contrib,
-
Ken Krugler
committed
1989ca9d:
Added class for configuring Hadoop jobs (and some
-
Ken Krugler
committed
73a65386:
Added Nutch's StringUtils, and pulled in our string
-
Ken Krugler
committed
f393e00e:
Log URL of item being parsed, at trace level.
-
Ken Krugler
committed
a6eb9fef:
When we get a fetch error, log at trace level.
-
Ken Krugler
committed
ac753118:
Output extra info when a commonly failing integration test
-
Ken Krugler
committed
af0010b4:
Clean up name of FetchPipe content output tail pipe.
-
Ken Krugler
committed
2f833789:
Bump log level to trace during testing
-
Ken Krugler
committed
41cfad05:
Update pom for helpful contrib to use 1.0-SNAPSHOT of bixo,
-
Ken Krugler
committed
051ac7af:
Added bixo.root.level property, so it's easier to override this
-
Ken Krugler
committed
d28ca2fc:
Output trace msg when generating a fetch list.
-
Ken Krugler
committed
685cd538:
Output number of remaining URLs in status msg.
-
Ken Krugler
committed
d9bda6c5:
Trim the href URL we get back from the Tika parser, since it
-
Ken Krugler
committed
016868e1:
Don't bother trying to resolve relative "about:" links.
-
Ken Krugler
committed
a8230c56:
Added test and fixed bug where domain name ended
-
Ken Krugler
committed
9e1b1045:
Added serialization test.
-
Ken Krugler
committed
c8ae7813:
Fixed bug where exception data (especially the msg) wasn't
-
Ken Krugler
committed
4e26d277:
Fixed bug that showed up when last header had key but no value
-
Ken Krugler
committed
6f259143:
Update to release procedure.
And 25 more commits...
New tag is at emi/bixo/tree/0.4.6
HEAD is 55655d0258036a7f9dc030f49d25e6b690af3cf4
x
-
Ken Krugler
committed
55655d02:
The integration test build & execution classpath needs the
-
Ken Krugler
committed
b4e46124:
More release procedure improvements.
-
Ken Krugler
committed
67fb6454:
Make names of util package classes consistent.
-
Ken Krugler
committed
d1df2a1f:
Minor tweak to order of steps for releasing.
-
Ken Krugler
committed
184a8ee2:
First cut at command line tool that can be used to do
-
Ken Krugler
committed
c875d12f:
Use -crawldir vs. -outputdir to indicate directory that contains
-
Ken Krugler
committed
a3fcac3d:
Added UserAgent class, which is now used in place of full user-agent string.
-
Ken Krugler
committed
b80f1487:
Changed SiteCrawlerTest to be a long-running test, and improved
-
Ken Krugler
committed
01153012:
The HttpClient 3.1 code has to be there when running SimpleCrawlTool,
-
Ken Krugler
committed
56d6848e:
Include all dependent libraries when doing distribution build.
-
Ken Krugler
committed
59a12ec6:
Get rid of unused test code.
HEAD is 128d9c2e6f9a268dad21b9cbc9e8371792605e7d
x
-
Ken Krugler
committed
128d9c2e:
Added utility class for creating/manipulating "loop" crawl
-
Ken Krugler
committed
6e875284:
Try to mask a bogus Hadoop map-reduce warning.
-
Ken Krugler
committed
2d0e0854:
Fixed bug where check for JavaScript: and such was
-
Ken Krugler
committed
167d013d:
Removed addition of "www." to raw (paid level) domains,
-
Ken Krugler
committed
630d715d:
Reworked SimpleCrawlTool to be a site crawler that can
-
Ken Krugler
committed
50f7424c:
Minor naming cleanup
-
Ken Krugler
committed
62a64e59:
Add explicit check for sitemap: directive, which
-
Ken Krugler
committed
42a5c649:
Fixed a bug where we were setting the wrong UrlStatus when a
-
Ken Krugler
committed
597dca31:
Add utility for injecting new meta-data value.
HEAD is df30703fbfed5fd8841ac7d6ad305a0b96cb4f37
x
-
Ken Krugler
committed
03285954:
Update to use new StringUtils support.
-
Ken Krugler
committed
43da7ff6:
Added some documentation on the helpful contrib,
-
Ken Krugler
committed
1989ca9d:
Added class for configuring Hadoop jobs (and some
-
Ken Krugler
committed
73a65386:
Added Nutch's StringUtils, and pulled in our string
-
Ken Krugler
committed
f393e00e:
Log URL of item being parsed, at trace level.
-
Ken Krugler
committed
a6eb9fef:
When we get a fetch error, log at trace level.
-
Ken Krugler
committed
ac753118:
Output extra info when a commonly failing integration test
-
Ken Krugler
committed
af0010b4:
Clean up name of FetchPipe content output tail pipe.
-
Ken Krugler
committed
2f833789:
Bump log level to trace during testing
-
Ken Krugler
committed
41cfad05:
Update pom for helpful contrib to use 1.0-SNAPSHOT of bixo,
-
Ken Krugler
committed
051ac7af:
Added bixo.root.level property, so it's easier to override this
-
Ken Krugler
committed
d28ca2fc:
Output trace msg when generating a fetch list.
-
Ken Krugler
committed
685cd538:
Output number of remaining URLs in status msg.
-
Ken Krugler
committed
d9bda6c5:
Trim the href URL we get back from the Tika parser, since it
-
Ken Krugler
committed
016868e1:
Don't bother trying to resolve relative "about:" links.
-
Ken Krugler
committed
a8230c56:
Added test and fixed bug where domain name ended
-
Ken Krugler
committed
9e1b1045:
Added serialization test.
-
Ken Krugler
committed
c8ae7813:
Fixed bug where exception data (especially the msg) wasn't
-
Ken Krugler
committed
4e26d277:
Fixed bug that showed up when last header had key but no value
-
Ken Krugler
committed
6f259143:
Update to release procedure.
And 3 more commits...
HEAD is 309151feabb118c183c434d48386ddff01a036e7
x
-
Ken Krugler
committed
309151fe:
Committing 0.4.5 release build
-
Ken Krugler
committed
1dcc4f2b:
Merge branch 'master' into release
-
Ken Krugler
committed
13fd0af5:
Fixed bug where connection wasn't being aborted when we got back
-
Ken Krugler
committed
c0ef605f:
If the score was set to the special "skip me" value, then
-
Ken Krugler
committed
23db0900:
If we get an exception while loading robots.txt, don't kill the
-
Ken Krugler
committed
ec6d9d38:
Add support for getting the tail pipe name.
-
Ken Krugler
committed
1967fe3a:
Added test for robots.txt redirecting to https, and getting an error fetching that.
-
Ken Krugler
committed
6ecbbe1e:
Update status whenever number of URLs/domains being fetched changes.
-
Ken Krugler
committed
a1eba20a:
Output HTTP status code & headers w/getMessage() call.
-
Ken Krugler
committed
5042a31e:
Added SKIPPED_FILTERED in preparation for better filtering support.
-
Ken Krugler
committed
c301eee8:
IScoreGenerator.generateScore no longer throws an IOException
-
Ken Krugler
committed
833d9a7d:
Improve output when we get a fetch exception.
-
Ken Krugler
committed
036fead6:
Add status time to datum.
-
Ken Krugler
committed
c1851901:
Fixed Javadoc comment
-
Ken Krugler
committed
9e24a64b:
Pulled out the RedirectResponseHandler and made it more general for use
-
Ken Krugler
committed
8b5e8750:
Added explicit UNSET_CRAWL_DELAY to differentiate from the default crawl delay,
-
Ken Krugler
committed
6dd68893:
Added trust manager (from Nutch) needed for no-certificate SSL (https) connections.
-
Ken Krugler
committed
804517ff:
Fixed up comments.
-
Ken Krugler
committed
35e02bcd:
Minor update to release procedure
New tag is at emi/bixo/tree/0.4.5
HEAD is 13fd0af515141a8b08ac1768bc97c8c4c19839ad
x
-
Ken Krugler
committed
13fd0af5:
Fixed bug where connection wasn't being aborted when we got back
-
Ken Krugler
committed
c0ef605f:
If the score was set to the special "skip me" value, then
-
Ken Krugler
committed
23db0900:
If we get an exception while loading robots.txt, don't kill the
-
Ken Krugler
committed
ec6d9d38:
Add support for getting the tail pipe name.
-
Ken Krugler
committed
1967fe3a:
Added test for robots.txt redirecting to https, and getting an error fetching that.
-
Ken Krugler
committed
6ecbbe1e:
Update status whenever number of URLs/domains being fetched changes.
-
Ken Krugler
committed
a1eba20a:
Output HTTP status code & headers w/getMessage() call.
-
Ken Krugler
committed
5042a31e:
Added SKIPPED_FILTERED in preparation for better filtering support.
-
Ken Krugler
committed
c301eee8:
IScoreGenerator.generateScore no longer throws an IOException
-
Ken Krugler
committed
833d9a7d:
Improve output when we get a fetch exception.
-
Ken Krugler
committed
036fead6:
Add status time to datum.
-
Ken Krugler
committed
c1851901:
Fixed Javadoc comment
-
Ken Krugler
committed
9e24a64b:
Pulled out the RedirectResponseHandler and made it more general for use
-
Ken Krugler
committed
8b5e8750:
Added explicit UNSET_CRAWL_DELAY to differentiate from the default crawl delay,
-
Ken Krugler
committed
6dd68893:
Added trust manager (from Nutch) needed for no-certificate SSL (https) connections.
-
Ken Krugler
committed
804517ff:
Fixed up comments.
-
Ken Krugler
committed
35e02bcd:
Minor update to release procedure
HEAD is 50db513f275efdbc9937a5642e1e03e8bce2ae67
x
-
Ken Krugler
committed
b28aac06:
Moved test into integration area, as it depends on external (DNS) resources.
-
Ken Krugler
committed
60450e5b:
Move tool out of test.
-
Ken Krugler
committed
a0562398:
Support 1..n URLs in command line.
-
Ken Krugler
committed
5338f0a5:
Fixed handler used for redirection test.
-
Ken Krugler
committed
c05d6b9a:
Fixed generation of redirected URL
-
Ken Krugler
committed
08324a3c:
Improve output of toString()
-
Ken Krugler
committed
36b9ab42:
Get rid of release artifacts from master - these
-
Ken Krugler
committed
aaebdfb4:
Print extra info when we get a failure.
-
Ken Krugler
committed
8b288cf0:
Cleaned up vestigial use of "testcase".
-
Ken Krugler
committed
cb6e759e:
Comment out failing test for now.
-
Ken Krugler
committed
4a11b674:
Doesn't need the simulation web server.
-
Ken Krugler
committed
b8305d11:
Rename long-running tests to xxxLRTest, so we can key off that
-
Ken Krugler
committed
3e311c02:
Add step re getting rid of old distributions.
-
Ken Krugler
committed
65c15b86:
Clean up pom.xml
-
Ken Krugler
committed
6a62ebf0:
Update release info
-
Ken Krugler
committed
375c2f63:
We no longer depend on ICU4J, since we don't directly include the
-
Ken Krugler
committed
1ca2b20b:
More the "no domain" test to integration tests, since
-
Ken Krugler
committed
b88fffdf:
Get rid of huge DMOZ file so our dist build is more reasonable.
-
Ken Krugler
committed
b7119feb:
Updated release procedure doc to match new reality.
-
Ken Krugler
committed
7774bc1d:
Set version to 1.0-SNAPSHOT in master
And 20 more commits...
New tag is at emi/bixo/tree/0.4.4
HEAD is 76e17b812e3d8110d14ecfcf34f0e9f8276af68b
x
-
Ken Krugler
committed
76e17b81:
Use clean when doing dist
HEAD is 49d3d85d9b4fe17a49730eac3144c5c65b65220d
x
-
Ken Krugler
committed
49d3d85d:
Added big gnarly test case for all incoming UrlDatum tuples getting
-
Ken Krugler
committed
9121559e:
Clean up code to use one collector for regular/error cases, since we don't
-
Ken Krugler
committed
fd8e6529:
Bump up max redirects and retries.
-
Ken Krugler
committed
ca886519:
Minor name cleanup
-
Ken Krugler
committed
f485ba1a:
Support specifying the max retry count.
-
Ken Krugler
committed
c655e366:
Add log4j.properties to src/main/resources so that it's possible to control
HEAD is 9eb0b5a02a21070d1dbdeaf153eaf2d488fb774a
x
-
Ken Krugler
committed
9eb0b5a0:
Updated tests to work.
-
Ken Krugler
committed
db381144:
handle UrlStatus coming back (vs. raw string) from
-
Ken Krugler
committed
4a67c7e5:
Get rid of support for stuffing FetcherPolicy into properties - the
-
Ken Krugler
committed
7c25b5e2:
Add isAllowed(URL) method that never throws a checked exception.
-
Ken Krugler
committed
d0e13e08:
Use UrlStatus.FETCHED vs. funky string for valid fetches.
-
Ken Krugler
committed
d3943e65:
No longer return boolean from offer, as it can't ever be used
-
Ken Krugler
committed
1ca54346:
The only current reason for an aborted fetch exception is
-
Ken Krugler
committed
d139e918:
Added new SKIPPED_XXX statuses (stati?)
-
Ken Krugler
committed
c9262eef:
Added additional helper constructor.
-
Ken Krugler
committed
8825c9e0:
Use extra funky values for NO_XXX constants, so that using 0
-
Ken Krugler
committed
248e5c42:
Added info about & license for DMOZ data.
-
Ken Krugler
committed
b28aac06:
Moved test into integration area, as it depends on external (DNS) resources.
-
Ken Krugler
committed
60450e5b:
Move tool out of test.
HEAD is a0562398327b01734a28e5645e8b905f630aaf4a
x
-
Ken Krugler
committed
a0562398:
Support 1..n URLs in command line.
-
Ken Krugler
committed
5338f0a5:
Fixed handler used for redirection test.
-
Ken Krugler
committed
c05d6b9a:
Fixed generation of redirected URL
-
Ken Krugler
committed
08324a3c:
Improve output of toString()
-
Ken Krugler
committed
36b9ab42:
Get rid of release artifacts from master - these
-
Ken Krugler
committed
aaebdfb4:
Print extra info when we get a failure.
HEAD is 8b288cf0fcf7c01d65669330bbe7bea963494ade
x
-
Ken Krugler
committed
cb6e759e:
Comment out failing test for now.
-
Ken Krugler
committed
4a11b674:
Doesn't need the simulation web server.
-
Ken Krugler
committed
b8305d11:
Rename long-running tests to xxxLRTest, so we can key off that
-
Ken Krugler
committed
3e311c02:
Add step re getting rid of old distributions.
-
Ken Krugler
committed
65c15b86:
Clean up pom.xml
-
Ken Krugler
committed
6a62ebf0:
Update release info
-
Ken Krugler
committed
375c2f63:
We no longer depend on ICU4J, since we don't directly include the
-
Ken Krugler
committed
1ca2b20b:
More the "no domain" test to integration tests, since
-
Ken Krugler
committed
b88fffdf:
Get rid of huge DMOZ file so our dist build is more reasonable.
-
Ken Krugler
committed
b7119feb:
Updated release procedure doc to match new reality.
-
Ken Krugler
committed
7774bc1d:
Set version to 1.0-SNAPSHOT in master
-
Ken Krugler
committed
5b9bf788:
Rolled in HTML test.
-
Ken Krugler
committed
61727e84:
Get rid of trailing space that was creating a test failure.
-
Ken Krugler
committed
787f187b:
Switch to SimpleParser (wrapper for Tika) from the Nutch HTML parser.
-
Ken Krugler
committed
81173cf7:
Hide HttpClient 3.1 wire logging, which shows up when
-
Ken Krugler
committed
7d713a6b:
Handle things like mailto: and javascript: in links, by
-
Ken Krugler
committed
1ed36a9c:
Get rid of now unused Nutch parsing code, since we're using Tika.
-
Ken Krugler
committed
381711ee:
Fixed up datum creation
-
Ken Krugler
committed
2d2d463d:
Fixed up datum creation
-
Ken Krugler
committed
8ed9a9a6:
Added safety checks for creating datum with null values.
And 1 more commits...
HEAD is c16d7b22361989abf30c0ce31d0ac6dad883b59e
x
-
Ken Krugler
committed
c16d7b22:
Committing 0.4.3 distribution build
-
Ken Krugler
committed
5b9bf788:
Rolled in HTML test.
-
Ken Krugler
committed
61727e84:
Get rid of trailing space that was creating a test failure.
-
Ken Krugler
committed
787f187b:
Switch to SimpleParser (wrapper for Tika) from the Nutch HTML parser.
-
Ken Krugler
committed
81173cf7:
Hide HttpClient 3.1 wire logging, which shows up when
-
Ken Krugler
committed
7d713a6b:
Handle things like mailto: and javascript: in links, by
-
Ken Krugler
committed
1ed36a9c:
Get rid of now unused Nutch parsing code, since we're using Tika.
-
Ken Krugler
committed
381711ee:
Fixed up datum creation
-
Ken Krugler
committed
2d2d463d:
Fixed up datum creation
-
Ken Krugler
committed
8ed9a9a6:
Added safety checks for creating datum with null values.
New tag is at emi/bixo/tree/0.4.3
New tag is at emi/bixo/tree/0.1
kkrugler
deleted branch cfetcher at kkrugler/bixo
Mon Sep 28 17:19:00 -0700 2009
Deleted branch was at kkrugler/bixo/tree/cfetcher
kkrugler
deleted branch tuplefun at kkrugler/bixo
Mon Sep 28 17:18:40 -0700 2009
Deleted branch was at kkrugler/bixo/tree/tuplefun
New branch is at emi/bixo/tree/release
HEAD is d92de1545e712ab5768e26df0d987feaa691503f
x
-
Ken Krugler
committed
d92de154:
Just to be clear, renamed HtmlParser to TikaHtmlParser, since
-
Ken Krugler
committed
a57fc806:
Rename HtmlParser to NutchHtmlParser.
-
Ken Krugler
committed
84a74aeb:
Minor tweak to allow relative URLs in <base> tag, even if
-
Ken Krugler
committed
74d4246d:
Switch to using our patched version of Tika, which has been
-
Ken Krugler
committed
896c4a40:
Add support for "install" ant target that installs resulting jar
