kig / metadata

File metadata extraction tool and Ruby library

This URL has Read+Write access

metadata / README
100755 403 lines (323 sloc) 12.21 kb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
Thanks
------
 
  Konrad Meyer for his patient testing and bug reports.
  Darren Kirby for the heads-up on wmainfo's ASF-parsing capabilities
  (along with being the author of wmainfo-rb and flacinfo-rb.)
 
 
Description
-----------
 
  This package `Metadata' comes with a library called `metadata' and
  a small program called `mdh'.
 
  The library probes files for their metadata (e.g. jpeg dimensions
  and camera make, mp3 artist, pdf text and word count) and returns the
  metadata as a Hash. All strings in the metadata are converted to UTF-8.
 
  The `mdh'-program can print out file metadata as YAML and package the
  metadata with the file.
 
  The metadata hash follows the shared file metadata spec naming, with some
  additional fields, see list at the end of this file (Appendix A.)
 
  For details on the MDH file format, see the end of this file (Appendix B.)
 
 
Usage
-----
 
  # print out metadata for myfile.jpg
  mdh myfile.jpg
 
  # create myfile.jpg.mdh, which consists of an MDH metadata header + myfile.jpg
  mdh -c myfile.jpg
 
  # print out the metadata header from an MDH file
  mdh -e -p myfile.jpg.mdh
 
  # strip out the metadata header from an MDH file and write the actual file
  # to myfile.jpg
  mdh -e myfile.jpg.mdh
 
  # include file path, filename, md5sum and sha1sum in the metadata header
  mdh --path --name -m -s myfile.jpg
 
  # guess title for document (first line that starts with a capital letter)
  mdh --guess-title foo.ps
 
  # guess title, abstract and metadata for document
  mdh --guess-metadata foo.ps
 
  # don't include document text (File.Content) in the metadata
  mdh --no-text foo.ps
 
  # query CiteSeer with the document title, add possible results to metadata
  mdh --citeseer foo.ps
 
  # query DBLP with the document title, add possible results to metadata
  mdh --dblp foo.ps
 
  # If you have an unknown CS document, this might help identify it:
  mdh --guess-metadata --dblp --citeseer foo.ps
 
  # print out the list of options
  mdh -h
 
  irb> require 'metadata'
  irb> Metadata.extract('myfile.jpg')
  irb> Metadata.extract_text('myfile.pdf')
  irb> Pathname.new("myfile.jpg").metadata
 
 
List of supported formats
-------------------------
 
  Audio:
    Whatever you manage to make mplayer play.
    Plus special handlers for FLAC, m4a, ape, musepack, wavepack and wma.
 
    Successfully tested with:
      mp3, flac, ogg, wav, ra, m4a, wma
 
    Should also work:
      wv, mpc, ape
 
 
  Video:
    Whatever you manage to make mplayer play.
 
    Successfully tested with:
      wmv, mov, divx, xvid, flv, ogg, mpg, mkv
 
 
  Images:
    Should handle pretty much anything.
    I.e. anything handled by ExifTool, ImageMagick, Imlib2 or dcraw.
 
    Successfully tested with:
      Web formats:
        jpeg, png, gif, svg
      Camera raws:
        nef, dng, crw, pef, orf, arw, raf, cr2
      Image editor state dumps:
        psd, xcf
      The rest:
        tga, tif, bmp, xpm, ppm, pcx
 
 
  Documents:
    Successfully tested with:
      Web formats:
        html, txt
      Print formats:
        pdf, ps, ps.gz
      OO formats:
        sxi, odp
      MS formats:
        doc, ppt, xls
 
    - I'm using unoconv to convert OO & MS docs to temp PDFs for the text &
      dimensions extraction, so those bits of data are missing. MSOffice docs
      are missing dimensions for the same reason. Here's a way to get them:
      ( first, get Thumbnailer: http://github.com/kig/thumbnailer/tree/master )
      $ thumbnailer -s 1 -k foo.odp /tmp/foo.jpg
      $ mdh foo.odp
      $ rm foo.odp-temp.pdf /tmp/foo.jpg
 
 
  Others:
    - BitTorrent .torrent files
    - Archive contents: tar.gz, zip
    - Whatever `extract' outputs and I am handling
 
 
  Formats that yield very little metadata:
    ai
 
  Formats that don't yield usable metadata:
    chm, sis, rb, rar, ttf
 
  Formats that fail mimetype guessing:
    exr
 
 
 
Requirements
------------
 
  * Ruby 1.8
 
  * Tons of metadata extraction programs and libs.
    This package has many dependencies since there is no single universal
    metadata header format that all files use. Blame resource forks, filename
    extensions, bags of bytes and mimetypes.
 
    List of gems:
      flacinfo-rb
      wmainfo-rb
      MP4Info
      id3lib-ruby
      apetag
      text
      hpricot
      ruby-mp3info
 
    List of Debian packages:
      dcraw
      libimlib2-ruby
      extract
      libimage-exiftool-perl
      poppler-utils
      mplayer
      html2text
      imagemagick
      unhtml
      pstotext
      antiword
      catdoc
      shared-mime-info
 
  * You do want to install the latest versions of dcraw and
    shared-mime-info to be able to handle camera raw images.
    http://cybercom.net/~dcoffin/dcraw/
    http://freedesktop.org/wiki/Software/shared-mime-info
 
  * Python + chardet library
    http://chardet.feedparser.org/
 
 
Install
-------
 
  De-compress archive and enter its top directory.
  Then type:
 
   ($ su)
    # ruby setup.rb
 
  These simple step installs this program under the default
  location of Ruby libraries. You can also install files into
  your favorite directory by supplying setup.rb some options.
  Try "ruby setup.rb --help".
 
 
Appendix A: Metadata fields
--------------------------------------
 
  This list contains the metadata fields output by Metadata and mdh.
  The list follows the shared file metadata spec for the most part.
  http://wiki.freedesktop.org/wiki/Specifications/shared-filemetadata-spec
 
  field name | field type
  ----------------------------------------------------------------------
  Archive.Contents array of pathnames
 
  Audio.Band string
  Audio.Composer string
  Audio.Conductor string
  Audio.Copyright string (copyright message)
  Audio.Grouping string
  Audio.Image base64-encoded binary string (embedded image data)
  Audio.InterpretedBy string
  Audio.Lyricist string
  Audio.Publisher string
  Audio.RemixedBy string
  Audio.Subtitle string
  Audio.Tempo integer
  Audio.VariableBitrate boolean
  Audio.Writer string
  Audio.Publicationright string
  Audio.File string
  Audio.EAN/UPC string
  Audio.ISBN string
  Audio.Catalog string
  Audio.LC string
  Audio.Media string
  Audio.Index string
  Audio.Related string
  Audio.ISRC string
  Audio.Abstract string
  Audio.Language string
  Audio.Bibliography string
  Audio.Introplay string
  Audio.Dummy string
  Audio.DebutAlbum string
  Audio.RecordDate string
  Audio.RecordLocation string
  v-- ORIGINAL FIELDS USED --v
  Audio.Title string
  Audio.Artist string
  Audio.Album string
  Audio.AlbumArtist string
  Audio.AlbumTrackCount integer
  Audio.TrackNo integer
  Audio.DiscNo integer
  Audio.Performer string
  Audio.Duration float
  Audio.ReleaseDate datetime
  Audio.Comment string
  Audio.Genre string
  Audio.Codec string
  Audio.Samplerate integer
  Audio.Bitrate float
  Audio.Channels integer
  Audio.Lyrics string
 
  Doc.Album string
  Doc.Artist string
  Doc.Charset string
  Doc.Description string
  Doc.Genre string
  Doc.Language string
  Doc.ModifyDate date
  Doc.PageSizeName string (A4, A5, letter, ...)
  Doc.RevisionHistory array of strings
  Doc.ParagraphCount integer
  Doc.LineCount integer
  Doc.CharacterCount integer
  Doc.LastSavedBy string
  Doc.Keywords array of strings
  Doc.Template string
  Doc.Publisher string
  Doc.PublicationName string
  Doc.PublicationPages string
  Doc.Citations array of {href=>a, title=>b, rest=>c} hashes
  Doc.Contributor string
  Doc.CiteSeerIdentifier string
  Doc.CiteSeerURL string
  Doc.Published datetime
  Doc.Source string
  Doc.DBLPIdentifier string
  Doc.CrossRef string (BibTex crossref)
  Doc.BibSource string (BibTex source)
  Doc.BibTexType string (BibTex type: article, inbook, ...)
  Doc.ACMCategories array of strings
  v-- ORIGINAL FIELDS USED --v
  Doc.Title string
  Doc.Subject string
  Doc.Author string
  Doc.PageCount integer
  Doc.WordCount integer
  Doc.Created datetime
 
  File.Software string (software used to create the file)
  File.MD5Sum string (md5sum of file's contents)
  File.SHA1Sum string (sha1sum of file's contents)
  v-- ORIGINAL FIELDS USED --v
  File.Name string (basename of the file)
  File.Path string (dirname of the file)
  File.Format string (mime type, inode/directory for dirs)
  File.Size integer
  File.Content string
  File.Modified string
 
  Image.DateCreated date
  Image.DateTimeCreated date
  Image.DateTimeOriginal date
  Image.DimensionUnit string (px, mm, pt, ...)
  Image.Editor string
  Image.EXIF string (exiftool output)
  Image.FrameCount integer
  Image.LayerCount integer
  Image.Modified date
  Image.OriginatingProgram string
  Image.ComponentCount integer
  Image.ColorMode string (e.g. RGB)
  Image.ColorSpace string (e.g. sRGB)
  v-- ORIGINAL FIELDS USED --v
  Image.Height float
  Image.Width float
  Image.Title string
  Image.Date datetime
  Image.Creator string
  Image.Description string
  Image.Software string
  Image.CameraMake string
  Image.CameraModel string
  Image.ExposureProgram string
  Image.ExposureTime float
  Image.Fnumber float
  Image.Flash boolean
  Image.FocalLength float
  Image.ISOSpeed float
  Image.MeteringMode string
  Image.WhiteBalance string
  Image.Copyright string
 
  Location.Latitude float
  Location.Longitude float
 
  Video.Album string
  Video.Artist string
  Video.Bitrate integer
  Video.Codec string
  Video.Comment string
  Video.Duration float
  Video.Framerate float (frames per second)
  Video.Genre string
  Video.ReleaseDate date
  Video.Title string
  Video.TrackNo integer
  Video.Demuxer string
 
  BitTorrent.Name string
  BitTorrent.Files array of { 'path' => string,
                                       'length' => integer,
                                       'md5sum' => string }
  BitTorrent.Length integer (size of single-file torrents)
  BitTorrent.MD5Sum string (md5sum for single-file torrents)
  BitTorrent.PieceCount integer
  BitTorrent.PieceLength integer (length of a single piece
  BitTorrent.Comment string
  BitTorrent.Announce string (announce url)
  BitTorrent.AnnounceList array of arrays of strings
  BitTorrent.Nodes array of [hostname, port] -arrays
 
 
 
Appendix B: The MDH file format
-------------------------------
 
  MDH files are built as follows:
 
  bytes | content
  ---------------
    3 | "MDH" - MDH file format identifier
    1 | "\x01" - MDH file format version number
    4 | Long, network byte order - the size of the metadata struct in bytes
   var | YAML - The MDH metadata struct
   var | The actual file contents
 
  All string fields in the metadata are UTF-8.
 
 
License
-------
 
  Ruby's
 
 
Ilmari Heikkinen <ilmari.heikkinen gmail com>