Add textbook-html Relax-NG Grammar #9

philschatz · 2017-05-10T21:23:19Z

Click me to View the Docs

The goal is to validate against the raw (and then baked) books in https://github.com/Connexions/cnx-rulesets

TODO

First, widen the grammar to stop erroring:

add XHTML5 elements that openstax uses
- the latest grammar that I found only supports XHTML 1.0
- examples: <figure>, <section>
add itemprop attributes to the elements that need it
add data-type and data-* attributes to the elements that need it
add MathML module
- this adds the Unofficial "mathml2" module that is included in the cnxml RNG

Click me for ... Example Grammar for the book structure as a .rnc file (more readable)

default namespace = "http://www.w3.org/1999/xhtml"

include "_content.rnc"
# <head itemscope="itemscope" itemtype="http://schema.org/Book">
Structure.Book.Head.attlist =
  attribute itemscope { "itemscope" },
  attribute itemtype { "http://schema.org/Book" }
# from xhtml/modules/struct.rng but changed to only include a subset of valid "root" elements
Structure.Book.Body =
  element body {
    # <body itemscope="itemscope" itemtype="http://schema.org/Book">
    attribute itemscope { "itemscope" },
    attribute itemtype { "http://schema.org/Book" },
    body.attlist,
    # Actually, it's BookMetadata
    Structure.Book.Metadata,
    Structure.Book.ToC,
    (Structure.Chapter
     | # Preface or Appendix
       Structure.Page)+
  }
Structure.Page =
  element div {
    attribute data-type { "page" },
    class.attrib,
    id.attrib.required,
    Structure.PageMetadata,
    Structure.Page.Title,
    Structure.Page.Abstract?,
    Flow.model*,
    Content.Glossary?,
    Content.FootnoteRefs?
  }
Structure.Chapter =
  element div {
    attribute data-type { "chapter" },
    Structure.ChapterMetadata,
    element h1 {
      attribute data-type { "document-title" },
      Inline.model
    },
    Structure.Page+
  }
# <div data-type="document-title" id="auto_7a6f73c5-5378-408b-986f-e54416e12ad0_72544">Preface</div>
Structure.Page.Title =
  element div {
    attribute data-type { "document-title" },
    # id is not required for individual CNXML->HTML pages but *is* required for book conversion
    id.attrib,
    Inline.model
  }
# <div data-type="abstract" id="auto_7a6f73c5-5378-408b-986f-e54416e12ad0_78256">
Structure.Page.Abstract =
  element div {
    attribute data-type { "abstract" },
    # id is not required for individual CNXML->HTML pages but *is* required for book conversion
    id.attrib,
    Flow.model
  }
Structure.PageMetadata =
  element div {
    attribute data-type { "metadata" },
    Structure.Metadata.Title,
    Structure.Metadata.Uri,
    Structure.Metadata.ShortId,
    Structure.Metadata.Authors,
    Structure.Metadata.Publishers,
    Structure.Metadata.Permissions,
    Structure.Metadata.Description,
    Structure.Metadata.Keywords,
    Structure.Metadata.Subjects?,
    Structure.Metadata.Resources?
  }
Structure.Metadata.Title =
  element h1 {
    attribute data-type { "document-title" },
    attribute itemprop { "name" },
    Inline.model
  }
Structure.Metadata.Uri =
  element span {
    attribute data-type { "cnx-archive-uri" },
    attribute data-value { UUID-and-version.datatype }
  }
Structure.Metadata.ShortId =
  element span {
    attribute data-type { "cnx-archive-shortid" },
    attribute data-value { ShortId.datatype }
  }
Structure.Metadata.Authors =
  element div {
    attribute class { "authors" },
    (# Allow text like "Edited by:" intermingled in the DOM
     text?,
     Structure.Metadata.Authors.Item)+,
    # Allow text like "Edited by:" intermingled in the DOM
    text?
  }
Structure.Metadata.Authors.Item =
  element span {
    attribute data-type { "author" },
    id.attrib.required,
    attribute itemprop { "author" },
    attribute itemscope { "itemscope" },
    attribute itemtype { "http://schema.org/Person" },
    element a {
      attribute data-type { "cnx-id" },
      attribute href { Text.datatype },
      attribute itemprop { "url" },
      Text.datatype
    }
  }
Structure.Metadata.Publishers =
  element div {
    attribute class { "publishers" },
    (# Allow text like "Edited by:" intermingled in the DOM
     text?,
     Structure.Metadata.Publishers.Item)+,
    # Allow text like "Edited by:" intermingled in the DOM
    text?
  }
# Copy/Pasta from Book.Metadata.Authors.Item
# TODO: Combine these somehow.
Structure.Metadata.Publishers.Item =
  element span {
    attribute data-type { "publisher" },
    id.attrib.required,
    attribute itemprop { "publisher" },
    attribute itemscope { "itemscope" },
    attribute itemtype { "http://schema.org/Person" },
    element a {
      attribute data-type { "cnx-id" },
      attribute href { Text.datatype },
      attribute itemprop { "url" },
      Text.datatype
    }
  }
Structure.Metadata.Permissions =
  # change class="permissions" to some other attribute
  element div {
    attribute class { "permissions" },
    Structure.Metadata.Copyrights?,
    Structure.Metadata.License
  }
Structure.Metadata.License =
  element p {
    attribute class { "license" },
    text,
    # <a data-type="license" itemprop="...">
    element a {
      attribute data-type { "license" },
      attribute href { URI.datatype },
      attribute itemprop { "dc:license,lrmi:useRightsURL" },
      text
    }
  }
Structure.Metadata.Copyrights =
  element p {
    attribute class { "copyright" },
    # Copyright:
    text,
    # <span data-type="copyright-holder" id="copyright-holder-1" itemprop="copyright-holder" itemscope="itemscope" itemtype="http://schema.org/Person">
    #   <a data-type="cnx-id" href="cnxecon" itemprop="url">OpenStax Economics</a>
    # </span>
    Structure.Metadata.Copyrights.Item+
  }
Structure.Metadata.Copyrights.Item =
  element span {
    attribute data-type { "copyright-holder" },
    attribute itemprop { "copyright-holder" },
    attribute itemscope { "itemscope" },
    attribute itemtype { "http://schema.org/Person" },
    element a {
      attribute data-type { "cnx-id" },
      # TODO: This attribute is invalid. the element should not be a link
      attribute href { UserLogin.datatype },
      attribute itemprop { "url" },
      UserName.datatype
    }
  }
# <span data-type="cnx-archive-uri" data-value="69619d2b-68f0-44b0-b074-a9b2bf90b2c6@11.332"/>
# <span data-type="cnx-archive-shortid" data-value="aWGdK2jw@11.332"/>
Structure.Book.Metadata =
  element div {
    attribute data-type { "metadata" },
    Structure.Metadata.Title,
    Structure.Metadata.Uri,
    Structure.Metadata.ShortId,
    Structure.Metadata.Authors,
    Structure.Metadata.Publishers,
    Structure.Metadata.PrintStyle,
    Structure.Metadata.TranslucentBinding?,
    Structure.Metadata.Permissions,
    Structure.Metadata.Description,
    Structure.Metadata.Subjects
  }
# <div class="print-style">
# Print style:
# <span data-type="print-style">ccap-economics</span>
# </div>
Structure.Metadata.PrintStyle =
  element div {
    attribute class { "print-style" },
    text,
    # "Print style:"
    element span {
      attribute data-type { "print-style" },
      TODO.enum.datatype
    }
  }
Structure.ChapterMetadata =
  element div {
    attribute data-type { "metadata" },
    Structure.Metadata.Title,
    Structure.Metadata.TranslucentBinding,
    Structure.Metadata.Permissions
  }
Structure.Metadata.TranslucentBinding =
  # TODO: remove me. why is this here?
  element span {
    attribute data-type { "binding" },
    attribute data-value { "translucent" }
  }
Structure.Metadata.Description =
  element div {
    attribute class { "description" },
    attribute data-type { "description" },
    attribute itemprop { "description" },
    Flow.model+
  }
Structure.Metadata.Keywords =
  # TODO: This should probably be wrapped in an element
  element div {
    attribute data-type { "keyword" },
    attribute itemprop { "keywords" },
    text
  }*
# <div data-type="subject" itemprop="about">Mathematics and Statistics</div>
Structure.Metadata.Subjects =
  element div {
    attribute data-type { "subject" },
    attribute itemprop { "about" },
    Subject.datatype
  }*
# <div data-type="resources" style="display: none"> <ul>
# <li><a href="971b6320e705d9c81cbad7f4a98148ab91456d3b">971b6320e705d9c81cbad7f4a98148ab91456d3b</a></li>        </ul>
Structure.Metadata.Resources =
  element div {
    attribute data-type { "resources" },
    # TODO: remove this attribute, maybe the whole element. What is it used for?
    attribute style { "display: none" },
    element ul { Structure.Metadata.Resources.Item+ }?
  }
Structure.Metadata.Resources.Item =
  element li {
    element a {
      attribute href { Sha.datatype },
      Sha.datatype
    }
  }
# The <nav id="toc">
Structure.Book.ToC =
  element nav {
    attribute id { "toc" },
    element ol {
      Structure.Book.ToC.LeafItem*,
      Structure.Book.ToC.InternalItem+,
      Structure.Book.ToC.LeafItem*
    }
  }
Structure.Book.ToC.LeafItem =
  element li {
    attribute cnx-archive-shortid { ShortId.datatype },
    attribute cnx-archive-uri { UUID-and-version.datatype },
    element a {
      attribute href { URI.datatype },
      text
    }
  }
Structure.Book.ToC.InternalItem =
  element li {
    element span { text },
    element ol {
      (Structure.Book.ToC.InternalItem | Structure.Book.ToC.LeafItem)+
    }
  }
# <meta content="en" data-type="language" itemprop="inLanguage"/>
# <meta content="MathML" itemprop="accessibilityFeature"/>
# <meta content="LaTeX" itemprop="accessibilityFeature"/>
# <meta content="alternativeText" itemprop="accessibilityFeature"/>
# <meta content="captions" itemprop="accessibilityFeature"/>
# <meta content="structuredNavigation" itemprop="accessibilityFeature"/>
# <meta content="2014-01-02T22:13:54Z" itemprop="dateCreated"/>
# <meta content="2016-11-04T17:17:06Z" itemprop="dateModified"/>
enum.attr.meta.itemprop =
  "inLanguage" | "accessibilityFeature" | "dateCreated" | "dateModified"
enum.attr.meta.data-type = "language"
# <meta content="2016-11-04T17:17:06Z" itemprop="dateModified"/>
meta.attlist &=
  # <meta content="en" data-type="language" itemprop="inLanguage"/>
  (attribute itemprop { "inLanguage" },
   attribute data-type {
     # TODO: Remove this attribute
     "language"
   })
  | # <meta content="MathML" itemprop="accessibilityFeature"/>
    
    # <meta content="LaTeX" itemprop="accessibilityFeature"/>
    
    # <meta content="alternativeText" itemprop="accessibilityFeature"/>
    
    # <meta content="captions" itemprop="accessibilityFeature"/>
    
    # <meta content="structuredNavigation" itemprop="accessibilityFeature"/>

    # <meta content="2014-01-02T22:13:54Z" itemprop="dateCreated"/>
    # <meta content="2016-11-04T17:17:06Z" itemprop="dateModified"/>
    attribute itemprop {
      "accessibilityFeature" | "dateCreated" | "dateModified"
    }

And then, fix HTML elements that are not allowed in a certain context (in the CNXML->HTML conversion)

And then, clean up the metadata blocks:

duplicate id="publisher-1" attributes
<div class="..."> (does not have a data-type or some other attribute) so it is not validatable
invalid id="{UUID}" attributes (needs to start with a character, not a number)
...

And then, begin restricting the elements and attributes:

And then:

clean up the TODO comments
organize the RNG files
convert XML comments into <a:documentation> elements

codecov · 2017-05-10T21:23:21Z

Codecov Report

Merging #9 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff          @@
##           master     #9   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           5      5           
  Lines          74     74           
=====================================
  Hits           74     74

java -jar ./cnxml/jing.jar ./textbook-html/textbook-html.rng /path/to/cnx-rulesets/data/econ-raw.xhtml | grep "error: attribute"

java -jar ./cnxml/jing.jar ./textbook-html/textbook-html.rng /path/to/cnx-rulesets/data/econ-raw.xhtml | grep "unknown element"

philschatz · 2017-05-11T00:11:14Z

yay, down to 1828 errors (in the econ book) from like 50k or so. I think the remaining errors require changes to both:

cnxml->HTML conversion (ie <table> inside a <p>)
cnx-epub or whichever repo adds the metadata blocks for each PageModule
- invalid id attributes for data-type="page"
- duplicate id's for publisher-1 (I think they do not need an id attribute)

@reedstrm where would I look to change the metadata blocks?

reedstrm · 2017-05-11T16:23:44Z

re: <div data-type="metadata"> Hmm,export_epub code, which uses the formatters in https://github.com/Connexions/cnx-epub/blob/master/cnxepub/formatters.py

reedstrm · 2017-05-15T16:24:49Z

to get an epub to validate, try cnx-archive-export_epub --format baked my_ident_hash_here //etc/cnx/archive/app.ini my.epub on tea.cnx.org

philschatz · 2017-05-15T19:07:09Z

@reedstrm Hmm, I tried and got the following:

cnx-archive-export_epub --format raw 69619d2b-68f0-44b0-b074-a9b2bf90b2c6 //etc/cnx/archive/app.ini my.epub
cnx-archive-export_epub: command not found

I also tried running the following but nothing showed up:

find / -name "cnx-archive-export_epub"

Where would I find that script on tea.cnx?

reedstrm · 2017-05-15T21:17:55Z

@philschatz check your scripts dir: the archive venv: /var/cnx/venv/archive/bin , I think.

philschatz · 2017-05-16T16:26:00Z

Thanks! that worked. I checked the econ book (69619d2b-68f0-44b0-b074-a9b2bf90b2c6) and the only thing that errored was the <span> tags in the ToC (they cannot have no attributes).

Of course, I'm only on the 1st part (widen the scope of validation so that books pass through). The 2nd part is "narrow the scope of validation so that books are structurally sound".

For example, right now an "exercise solution" can exist anywhere but it should only occur inside an exercise.

by adding boilerplate

philschatz · 2017-06-01T01:11:17Z

openstax/rhaptos.cnxmlutils@8dcf3d1 contains all the fixes to get algebra-intermediate-book to validate the CNXML->HTML conversion

philschatz · 2017-06-01T02:10:54Z

Astronomy now validates as well

philschatz · 2017-06-17T00:30:06Z

There are now autogenerated docs (from the .rng files) that can be viewed/reviewed at ./textbook-html/docs/

This should help in code-review and as a reference to link to

This seems to require prefixing RNG elements with `r:` so that the HTML elements can remain unprefixed

philschatz · 2017-11-07T19:26:24Z

Last night I tried rebuilding all the textbooks in https://github.com/philschatz/textbooks and ran into validation errors so it appears that this PR needs to be updated.

reedstrm

I haven't parsed the rng in detail, but the autogenerated docs seem accurate - assuming the comments match the code, we're good to go on this.

philschatz · 2017-11-08T18:11:40Z

@reedstrm Oops, I was wrong... they weren't "validation errors", this correctly caught the following:

an invalid link: Invalid link in Sociology oer.exports#2788 (comment)
an invalid mime type for images (was image/jpg but should be image/jpeg)

So, yeah, the grammar matches the XSLT in openstax/rhaptos.cnxmlutils#157

reedstrm · 2017-11-08T21:19:29Z

Ah, and the most recent commit doesn't pass tests for some reason?

philschatz · 2017-11-08T21:58:25Z

I'm a little confused by the question. The most-recent commit does pass tests (the Travis-CI checkmark).

Yesterday a book was failing validation and I thought it was because I needed to update the .rng files but it turns out the content was bad so validation correctly failed.

brittweinstein · 2019-02-22T20:39:47Z

CLOSING. If this presents itself as an issue again, we will reopen.

start textbook-html RNG

b6a9c34

philschatz added 6 commits May 10, 2017 17:39

remove all "error: attribute" jing errors

6e93cd7

java -jar ./cnxml/jing.jar ./textbook-html/textbook-html.rng /path/to/cnx-rulesets/data/econ-raw.xhtml | grep "error: attribute"

fix all "unknown element" jing errors (including math)

b7fed64

java -jar ./cnxml/jing.jar ./textbook-html/textbook-html.rng /path/to/cnx-rulesets/data/econ-raw.xhtml | grep "unknown element"

🐛 fix test

d585c10

🐛 allow certain children for elements

f800bad

🐛 make divs more flexible (just to reduce errors)

1f73ca4

fix all "required attributes missing" jing errors

9a3f981

philschatz added 11 commits May 17, 2017 03:11

enumerate attribute values

b79e7c7

better pattern matching for div notes, equations, and lists

db99f52

more support for subtypes of divs

400a7e3

🎨 simplify markup

aac2618

add chapter and page structure

bb98bf7

ensure book structure is validated

dfe16e5

✅ get test to pass

99cc104

by adding boilerplate

build several tests

7a1ad51

make schema attributes on <body> optional

1eeda22

support Example, inline newlines, and data-valign attribute

ce3e612

🐛 fixes for algebra-intermediate-book

a38379f

philschatz mentioned this pull request May 31, 2017

Change the conversion because of RNG validation errors openstax/rhaptos.cnxmlutils#157

Closed

get microbiology to validate (add footnotes)

3bd95cd

🐛 fixes for algebra-trigonometry-book

d18fdef

🎨 clarify what is optional/*/+ by wrapping in parentheses

a4a7d3e

philschatz added 16 commits June 16, 2017 19:35

🐛 optional attributes now have a "?"

9560ff4

collapse rng:optional rng:zeroOrMore rng:oneOrMore to attributes

7048e6e

reorder the content definitions so they are more readable

7ecf3b2

🎨 make choices a little clearer

9e4ca8a

allow documentation inside the RNG file

c2400a6

This seems to require prefixing RNG elements with `r:` so that the HTML elements can remain unprefixed

🎨 rename book-structure.rng to clarify

4220996

reorder book structure and make more h2 elements

333e3e0

🎨 change "Organization" to "Overview"

101d8b6

🎨 explain why body and elements are defined in the RNG

db9ca93

create a Blockish base class

8d3bd4c

more documentation (block vs inline, datatypes)

71ab541

🎨 docs and comments

1930e27

create Content.Section (instead of section)

d870526

🎨 clarify some datatypes

40a9504

🎨 move Glossary and FootnoteRefs into Structure.Page

4159e30

Support Content.CiteTitle

9e73b72

philschatz mentioned this pull request Nov 7, 2017

New book tools openstax/content-engineering-old#4

Merged

reedstrm approved these changes Nov 8, 2017

View reviewed changes

philschatz added 2 commits November 8, 2017 13:34

add link to HTML documentaion

5566f93

fixup! add link to HTML documentaion

2c315d4

helenemccarron mentioned this pull request May 17, 2018

Centered text shows up as a grey bar openstax/webview#1633

Closed

philschatz mentioned this pull request Jun 28, 2018

<newline> changes related to Different font color in list of authors openstax/webview#1666

Closed

brittweinstein closed this Feb 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add textbook-html Relax-NG Grammar #9

Add textbook-html Relax-NG Grammar #9

philschatz commented May 10, 2017 •

edited

Loading

codecov bot commented May 10, 2017 •

edited

Loading

philschatz commented May 11, 2017

reedstrm commented May 11, 2017

reedstrm commented May 15, 2017

philschatz commented May 15, 2017 •

edited

Loading

reedstrm commented May 15, 2017

philschatz commented May 16, 2017

philschatz commented Jun 1, 2017

philschatz commented Jun 1, 2017

philschatz commented Jun 17, 2017 •

edited

Loading

philschatz commented Nov 7, 2017

reedstrm left a comment

philschatz commented Nov 8, 2017

reedstrm commented Nov 8, 2017

philschatz commented Nov 8, 2017

brittweinstein commented Feb 22, 2019

Add textbook-html Relax-NG Grammar #9

Add textbook-html Relax-NG Grammar #9

Conversation

philschatz commented May 10, 2017 • edited Loading

Click me to View the Docs

TODO

codecov bot commented May 10, 2017 • edited Loading

Codecov Report

philschatz commented May 11, 2017

reedstrm commented May 11, 2017

reedstrm commented May 15, 2017

philschatz commented May 15, 2017 • edited Loading

reedstrm commented May 15, 2017

philschatz commented May 16, 2017

philschatz commented Jun 1, 2017

philschatz commented Jun 1, 2017

philschatz commented Jun 17, 2017 • edited Loading

philschatz commented Nov 7, 2017

reedstrm left a comment

Choose a reason for hiding this comment

philschatz commented Nov 8, 2017

reedstrm commented Nov 8, 2017

philschatz commented Nov 8, 2017

brittweinstein commented Feb 22, 2019

philschatz commented May 10, 2017 •

edited

Loading

codecov bot commented May 10, 2017 •

edited

Loading

philschatz commented May 15, 2017 •

edited

Loading

philschatz commented Jun 17, 2017 •

edited

Loading