Skip to content

oeuvres/odette

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Odette

From text processor odt file, extract all possible information in semantic XML (TEI).

Doc (in French): http://resultats.hypotheses.org/267

Demo: https://obvil.huma-num.fr/odette/

Maybe used with command line

~/myrepos $ sudo apt install git php php-cli php-xml 
~/myrepos $ git clone https://github.com/oeuvres/odette.git
~/myrepos $ cd odette
~/myrepos/odette $ php odette.php

php odette.php (options)? "teidir/*.xml"
Export odt files with styles as XML (ex: TEI)

Parameters:
globs       : 1-n files or globs

Options:
-h, --help   : show this help message
-f, --force  : force deletion of destination file
-d destdir   : destination directory for generated files
-t template  : a specific template for export among:
               delacroix, desc_chine, dramabib, galien, hauy, hurlus, merveilles17, rougemont
--tei        : default, export odt as XML/TEI
--html       : export odt as html
--odtx       : export native odt xml (for debug)

Known styles

Odette transpose some text processor direct formatting at paragraph level (left, right, center) and character level (italic, small caps…), but most of information is transmitted by user styles.

Text processor styles may be paragraph level (¶) or character level (@). Yous must ensure the level of your styles in your text processor if you want that Odette works well. Microsoft.Office may create linked styles, for example one style name for Quote, allowed for a full paragraph or for quotes of some words inline. This may confused an automat. It is good idea to conceive your template of styles in LibreOffice, you can record your template in docx format and edit texts with MS.Word (but you need to record files in odt at the end to transform it with Odette).

Example of Odette work, if you use the paragraph style <ab>, the para will be transformes in the xml

<ab type="ornament">My para</ab>

Below a list of normalized style name known, and their xml/tei transposition. Unknown styles are kept in a @rend attribute. Styles are here shown normalized as ascii lower case letter, but real life styles may contain capitals, accents, spaces, or punctuation. For example, quotesalute could appears as <Quote, Salute> for the user in its word processor (a style for a letter in a citation).

Paragraph level styles

ab

<ab type="ornament">content ¶</ab>

address

<address>
  <addrLine>content ¶</addrLine>
</address>

argument

<argument>
  <p>content ¶</p>
</argument>

bibl

<bibl>content ¶</bibl>

byline

<byline>content ¶</byline>

camera

<camera>content ¶</camera>

caption

<caption>content ¶</caption>

castitem

<castList>
  <castItem>content ¶</castItem>
</castList>

castlist

<castList>content ¶</castList>

closer

<closer>content ¶</closer>

dateline

<dateline>content ¶</dateline>

def

<entryFree>
  <def>content ¶</def>
</entryFree>

desc

<desc>content ¶</desc>

docauthor

<docAuthor>content</docAuthor>

docimprint

<docImprint>content ¶</docImprint>

docdate

<docDate>content ¶</docDate>

eg

<eg>content ¶</eg>

epigraph

<epigraph>
  <p rend="right italic…">content ¶</p>
</epigraph>

epigraphl

<epigraph>
  <l>content ¶</l>
</epigraph>

entry

<entry>content ¶</entry>

fw

<fw>content ¶</fw>

index

<index>
  <item>content ¶</item>
</index>

l

<l rend="center italic…">content ¶</l>

label

<label>content ¶</label>

labeldateline

<label type="dateline">content ¶</label>

labelhead

<label type="head">content ¶</label>

labelsalute

<label type="salute">content ¶</label>

labelspeaker

<label type="speaker">content ¶</label>

lg

<lg>
  <l>content ¶</l>
</lg>

opener

<opener>content ¶</opener>

p

<p rend="right italic…">content ¶</p>

pb

<pb n=""/>

postscript

<postscript>
  <p>content ¶</p>
</postscript>

q

<q>content ¶</q>

quote

<quote>
  <p rend="right, italic…">content ¶</p>
</quote>

quotedateline

<quote>
  <dateline>content ¶</dateline>
</quote>

quotel

<quote>
  <l>content ¶</l>
</quote>

quotesalute

<quote>
  <salute>content ¶</salute>
</quote>

quotesigned

<quote>
  <signed>content ¶</signed>
</quote>

role

<castItem>
  <role>content ¶</role>
</castItem>

roledesc

<castItem>
  <roleDesc>content ¶</roleDesc>
</castItem>

said

<said>content ¶</said>

salute

<salute>content ¶</salute>

salutation

<salute>content ¶</salute>

set

<set>
  <p>content ¶</p>
</set>

signed

<signed>content ¶</signed>

speaker

<speaker>content ¶</speaker>

stage

<stage>content ¶</stage>

term

<index>
  <term>content ¶</term>
</index>

trailer

<trailer>content ¶</trailer>

Character level styles

abbr

blah… <abbr>@ level</abbr> …blah

add

blah… <add>@ level</add> …blah

actor

blah… <actor>@ level</actor> …blah

author

blah… <author>@ level</author> …blah

affiliation

blah… <affiliation>@ level</affiliation> …blah

age

blah… <age>@ level</age> …blah

bibl

blah… <bibl>@ level</bibl> …blah

c

blah… <c>@ level</c> …blah

code

blah… <code>@ level</code> …blah

corr

blah… <corr>@ level</corr> …blah

date

blah… <date>@ level</date> …blah

del

blah… <del>@ level</del> …blah

distinct

blah… <distinct>@ level</distinct> …blah

email

blah… <email>@ level</email> …blah

emph

blah… <emph>@ level</emph> …blah

geogname

blah… <geogName>@ level</geogName> …blah

gloss

blah… <gloss>@ level</gloss> …blah

name

blah… <name>@ level</name> …blah

num

blah… <num>@ level</num> …blah

pb

blah… <pb>@ level</pb> …blah

persname

blah… <persName>@ level</persName> …blah

placename

blah… <placeName>@ level</placeName> …blah

stage

blah… <stage>@ level</stage> …blah

title

blah… <title>@ level</title> …blah