public
Description: Infinite Monkeywrench - A frameworks for collecting, peeling, and sharing delicious bananas of data.
Homepage: http://infinitemonkeywrench.org
Clone URL: git://github.com/infochimps/imw.git
imw / README-organization.txt
100644 352 lines (279 sloc) 14.901 kb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
Goal is *brevity* and *speed of programming* rather than *efficiency of final
code*
 
Where efficiency of final code becomes important,
 
---------------------------------------------------------------------------
 
 
Patterns in Munging Data
 
 
h3. TaskTracker
 
Process a series of objects [idempotent]ly:
* Take each object in order of (given) priority.
* Pass each object in turn to a Processor;
* log and timestamp the attempt;
* record a timestamp for the attempt;
* record a result code and extended description for its outcome.
 
No task (within optional time window) is repeated. Tasks can be inserted into the queue at any time and with any priority. Only rough order of priority is guaranteed.
 
ex.
*Scraping*: for each (depth-first search of a wildcarded URL path), (retrieve using wget into a URIFileStore); save (HTTP result code) and (wget's output)
 
h3. ChunkStore
 
For
* Collection of URIs
* Place & scheme to store them
* Fetcher to retrieve them
 
Lazily fetch documents:
*
* Store the retrieval timestamp
* Cache the document's metadata.
 
ex.
*URIFileStore* save each URI to a file whose name safely but recognizably corresponds to the original URI
*Document DB* such as GraphDB or Lucene
*Compressed* URI refers to files within a compressed package. When file is demanded expand the package into scratch space.
*InfochimpsMetadataset* Lazily retrieve the dataset+schema for an infochimps metadataset, lazily produce its (compressed) contents.
 
 
h3. Extract
 
* Given chunks of formatted data
** either in a common input format: xml, yaml, csv
** or parsed to a defined data structure
* return a stream of records.
 
 
ex.
  CSVExtractor.new() # use field names from first line to
                        construct a Struct class on the fly
                        return each line as struct
  Extract("foo.csv") # same as CSVExtractor.new.extract("foo.csv")
  Extract("foo.yaml") #
  Extract("foo.xml") # Return the "natural" (a la XML Simple) object tree
  HTMLExtractor (see below)
  
  
h4. HTML Extractor
 
* Map repeating HTML elements into data records
  Defining the document structure also defines the data structure
* Hpricot: http://code.whytheluckystiff.net/hpricot/wiki
* Selectors: http://docs.jquery.com/DOM/Traversing/Selectors
 
== Sample HTML (http://twitter.com:
 
  <ul class="about vcard entry-author">
    <li ><span class="label">Name</span> <span class="fn" >MarsPhoenix </span> </li>
    <li ><span class="label">Location</span> <span class="adr">Mars, Solar System</span> </li>
    <li id="bio"><span class="label">Bio</span> <span class="bio">I dig Mars! </span> </li>
    <li ><span class="label">Web</span>
       <a href="http://tinyurl.com/5wwaru" class="url" rel="me nofollow">http://tinyurl.co...</a></li>
  </ul>
 
== Parser Spec:
  :hcard => m_one('//ul.vcard.about',
    {
      :name => 'li/span.fn',
      :location => 'li/span.adr',
      :url => m_attr('li/a.url[@href]', 'href'),
      :bio => 'li#bio/span.bio',
    }
  )
 
== Example return:
  { :hcard => { :name => 'Mars Phoenix', :location => 'Mars, Solar System', :bio => 'I dig Mars!', :url => 'http://tinyurl.com/5wwaru' } }
  
== Sample HTML (http://delicious.com):
  <ul id="bookmarklist" class="bookmarks NOTHUMB">
    <li class="post" id="item-...">
      <div class="bookmark NOTHUMB">
        <div class="dateGroup"> <span title="23 APR 08">23 APR 08</span> </div>
        <div class="data">
          <h4> <a rel="nofollow" class="taggedlink" href="http://www.cs.biu.ac.il/~koppel/BlogCorpus.htm">Blog Authorship Corpus (Blogger.com 1994)</a>
                                        <a class="inlinesave" href="...">SAVE</a> </h4>
          <h5 class="savers-label"> PEOPLE</h5>
          <div class="savers savers2"> <a class="delNav" href="/url/7df6661946fca61863312644eb071953"><span class="delNavCount">26</span></a> </div>
          <div class="description"> The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. </div>
        </div>
        <div class="meta"></div>
        <h5 class="tag-chain-label">TAGS</h5>
        <div class="tagdisplay">
          <ul class="tag-chain">
            <li class="tag-chain-item off first"><a class="tag-chain-item-link" rel="tag" href="/infochimps/blog" ><span class="tag-chain-item-span">blog</span> </a></li>
            <li class="tag-chain-item off"> <a class="tag-chain-item-link" rel="tag" href="/infochimps/corpus" ><span class="tag-chain-item-span">corpus</span> </a></li>
            <li class="tag-chain-item off"> <a class="tag-chain-item-link" rel="tag" href="/infochimps/analysis" ><span class="tag-chain-item-span">analysis</span></a></li>
            <li class="tag-chain-item off"> <a class="tag-chain-item-link" rel="tag" href="/infochimps/nlp" ><span class="tag-chain-item-span">nlp</span> </a></li>
            <li class="tag-chain-item on last"> <a class="tag-chain-item-link" rel="tag" href="/infochimps/dataset" ><span class="tag-chain-item-span">dataset</span> </a></li>
          </ul>
        </div>
        <div class="clr"></div>
      </div>
    </li>
  </ul>
 
== Parser Specification:
  :bookmarks => [ 'ul#bookmarklist/li.post/.bookmark',
    {
      :date => hash( '.dateGroup/span',
         [:year, :month, :day] => regexp( '', /(\d{2}) ([A-Z]{3}) (\d{2})/),
         ),
      :title => '.data/h4/a.taggedlink',
      :url => attr( '.data/h4/a.taggedlink', 'href'),
      :del_link_url => href( '.data/.savers/a.delNav),
      :num_savers => to_i( '.data/.savers//span.delNavCount'),
      :description => '.data/.description',
      :tags => ['.tagdisplay//tag-chain-item-span']
    }
  ]
 
== Example output:
  { :bookmarks => [
    { :date => { :year => '08', :month => 'APR', :day => '23' },
      :title => 'Blog Authorship Corpus (Blogger.com 1994)',
      :url => 'http://www.cs.biu.ac.il/~koppel/BlogCorpus.htm',
      :del_link_url => '/url/7df6661946fca61863312644eb071953',
      :num_savers => 26,
      :description => 'The Blog ... ',
      :tags => ['blog', 'corpus', 'analysis', 'nlp', 'dataset'],
     }
   ]}
 
== Implementation:
 
Internally, we take the spec and turn it into a recursive structure of Matcher
objects. These consume Hpricot Elements and return the appropriately extracted
object.
 
Note that the /default/ is for a bare selector to match ONE element, and to not
complain if there are many.
 
Missing elements are silently ignored -- for example if
  :foo => 'li.missing'
there will simply be no :foo element in the hash (as opposed to having hsh[:foo]
set to nil -- hsh.include?(foo) will be false)
 
   
== List of Matchers:
    { :field => /spec/, ... } # hash hash, each field taken from spec.
    [ "hpricot_path" ] # 1-el array array: for each element matching
                                                        hpricot_path, the inner_html
    [ "hpricot_path", /spec/ ] # 2-el array array: for each element matching
                                                        hpricot_path, pass to spec
    "hpricot_path" # string same as one("hpricot_path")
    one("hpricot_path") # one first match to hpricot_path
    one("hpricot_path", /spec/) # one applies spec to first match to hpricot_path
    (these all match on one path:)
    regexp("hpricot_path", /RE/) # regexp capture groups from matching RE against
                                                        inner_html of first match to hpricot_path
    attr("hpricot_path", 'attr_name') # attr
    href("hpricot_path") # href shorthand for attr(foo, 'href')
    no_html # strip tags from contents
    html_encoded # html encode contents
    to_i, to_f, etc # convert
    lambda{|doc| ... } # proc calls proc on current doc
    
== Complicated HCard example:
    :hcards => [ '//ul.users/li.vcard',
      {
        :name => '.fn',
        :address => one('.adr',
          :street => '.street',
          :city => '.city',
          :zip => '.postal'
        )
        :tel => [ 'span.tel',
          {
            :type => 'span.type',
            [:cc, :area, :num] => hp.regexp('span.value', /+(\d+).(\d{3})-(\d{3}-\d{4})/),
          }
        ]
        :tags => [ '.tag' ],
      }
    ]
 
== Resulting Parser
    MatchHash({:hcards => MatchArray('//ul.users/li.hcard',
      MatchHash({
        :name => MatchFirst('.fn'),
        :address => MatchFirst('.adr',
          MatchHash({
            :street => MatchFirst('.street'),
            :city => MatchFirst('.locality),
            :state => MatchFirst('.region),
            :zip => MatchFirst('.postal'),
          }))
        :tel => MatchArray('span.tel',
          MatchHash({
            :type => MatchFirst('span.type'),
            [:cc, :area, :num] => RegexpMatcher('span.value', /+(\d+).(\d{3})-(\d{3}-\d{4})/),
          })
        )
        :tags => MatchArray('.tag'),
      })
    )
 
== Example output
    [
      {:tel => [ {:type => 'home', :cc => '49', :area => '305', :num => '555-1212'},
                     {:type => 'work', :cc => '49', :area => '305', :num => '555-6969'}, ],
       :name => "Bob Dobbs, Jr.",
       :tags => ["church"] },
      {:tel => [ {:type => 'fax', :cc => '49', :area => '305', :num => '867-5309'}, ],
       :name => "Jenny",
       :address => { :street => "53 Evergreen Terr.", :city => "Springfield" },
       :tags => ["bathroom", "wall"] },
    ]
    
 
h3. Transform
 
* map fields across
* simple unit conversion
 
Field Mapper
 
  NaturalMapper.map(raw, out) # map fields in raw onto fields that exist in out
   #
 
  RegexpMapper( [:tel], [:cc, :tel],
/\a\s*\+(\d+)?[\.\- ]? # (optional) country code
( \d{3}[\.\- ] # area code
\d{3}[\.\- ] # exchange
\d{4} )\s*\Z/x, # last four
   :warn => :tel) # on regexp miss, warn and put contents into output
 
h3. Schema
 
* Terse description of
 
h4. Dump
 
* Export schematized, self-aware data as objects / tables into many formats
 
 
h3. Reconcile
 
*
 
h4. Detect Duplicates
 
h4. Repair faulty records, leaving original intact
 
 
Reporter
 
Report progress, set application-level status messages and update Counters.
Tasks can use the Reporter to report progress or just indicate that they are alive.
Applications can also update Counters using the Reporter.
 
 
File store
 
    appendage _host_encoded(user)_encoded(password)
all but [a-zA-Z0-9] are encoded in user and password.
if scheme is HTTP and port, user, password are 80, nil and nil then the appendage is ''
otherwise all three parts are appended
    full_revhost "#{revhost}#{appendage}"
    tld_tier '_' + part of the encoded_revhost up to first '.'
    host_tier '_' + first two characters if any of sld (remainder of full_revhost after first '.'
    encode pathsegs split path with %r{/+}
     blank pathsegs are removed
     all characters outside of [A-Za-z0-9_-.] are encoded
additionally, leading non-alpha are encoded.
    [tiered_pathsegs] encoded pathsegs, optional intermediate tiers (having leading _) inserted if necessary.
    ext part of the file following and including its last '.'.
     Note that the filename is left untouched; this is just re-appended for convenience's sake
    uuid UUID.sha1_create(UUID_URL_NAMESPACE, url)
    filename encode(file?query#fragment)+uuid-date.ext
     since a '+' in a filename or path segment is encoded, a pathseg and filename will never collide.
 
  For the URL http://twitter.com/statuses/friends/bob.xml?since=Tue%2C+27+Mar+2007+22%3A55%3A48+GMT fetched at 4:20:00pm on 2008 Nov 4,
 
  URL parts:
      revhost, port, user, password, path, file, query, fragment
    [ com.twitter, nil, nil, nil, 'statuses/friends', bob.xml, since=Tue%2C+27+Mar+2007+22%3A55%3A48+GMT, nil
  File parts:
    [_tld_tier, _revhost_tier, revhost_scheme_port_user_password, [path, path, path].map(&:encode), encode(file?query#fragment), uuid(url), datetime, ext]
  Result:
    tld_tier _com
    revhost_tier _tw
    full_revhost com.twitter
    tiered_pathsegs [statuses, friends, _bo]
    filename bob.xml%3Fsince%3DTue%252C%2B27%2BMar%2B2007%2B22%253A55%253A48%2BGMT
    uuid 76eb6d0ad6fe5ae0b3128ec5e4a7a72f
    timestamp 20081104-162000
    ext .xml
 
  path:
    '_com/_tw/com.twitter/statuses/friends/_bo/bob.xml+3b78498d83a755e89b6e10cf7612ad8a+20081104-162000.xml'
    '_com/_tw/com.twitter/statuses/friends/_bo/bob.xml%3Fsince%3DTue%252C%2B27%2BMar%2B2007%2B22%253A55%253A48%2BGMT+76eb6d0ad6fe5ae0b3128ec5e4a7a72f+20081104-162000.xml'
 
  decoding:
    split on '/'
    discard all tiers /^_.*/
    first part is full_revhost.
      extract port, user and password if there
      unreverse revhost
    decode remaining pathsegs
    split filename, uuid, timestamp, extension.
    decode filename
 
  * you can find a uri with find -name "*+UUID_GOES_HERE+*"
  * the revhost etc means that files from common domains will appear together in
    file listing, INNODB indexes, etc
  
  
  path segments that start with _ are culled (these are 'tiers' to handle huge
  collections of files otherwise in the same directory)
 
 
      m = (%r{\A
            (#{Addressable::URI::HOST_TLD}) # tld tier
           /(..?) # revhost tier
           /([^/\:_]+) # revhost
        (?:_([^/\:]+))? # _scheme
        (?::(\d*):([^/]*)@([^@/]*?))? # :port:user@password
           /(?:(.*?)/)? # /dirs/
            ([^/]*) # file
           -([a-f0-9]{32}) # -uuid
                                \z}x.match(fp))