Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add textbook-html Relax-NG Grammar #9

Closed
wants to merge 71 commits into from
Closed

Conversation

philschatz
Copy link
Member

@philschatz philschatz commented May 10, 2017

Click me to View the Docs

The goal is to validate against the raw (and then baked) books in https://github.com/Connexions/cnx-rulesets

TODO

First, widen the grammar to stop erroring:

  • add XHTML5 elements that openstax uses
    • the latest grammar that I found only supports XHTML 1.0
    • examples: <figure>, <section>
  • add itemprop attributes to the elements that need it
  • add data-type and data-* attributes to the elements that need it
  • add MathML module
    • this adds the Unofficial "mathml2" module that is included in the cnxml RNG

Click me for ... Example Grammar for the book structure as a .rnc file (more readable)
default namespace = "http://www.w3.org/1999/xhtml"

include "_content.rnc"
# <head itemscope="itemscope" itemtype="http://schema.org/Book">
Structure.Book.Head.attlist =
  attribute itemscope { "itemscope" },
  attribute itemtype { "http://schema.org/Book" }
# from xhtml/modules/struct.rng but changed to only include a subset of valid "root" elements
Structure.Book.Body =
  element body {
    # <body itemscope="itemscope" itemtype="http://schema.org/Book">
    attribute itemscope { "itemscope" },
    attribute itemtype { "http://schema.org/Book" },
    body.attlist,
    # Actually, it's BookMetadata
    Structure.Book.Metadata,
    Structure.Book.ToC,
    (Structure.Chapter
     | # Preface or Appendix
       Structure.Page)+
  }
Structure.Page =
  element div {
    attribute data-type { "page" },
    class.attrib,
    id.attrib.required,
    Structure.PageMetadata,
    Structure.Page.Title,
    Structure.Page.Abstract?,
    Flow.model*,
    Content.Glossary?,
    Content.FootnoteRefs?
  }
Structure.Chapter =
  element div {
    attribute data-type { "chapter" },
    Structure.ChapterMetadata,
    element h1 {
      attribute data-type { "document-title" },
      Inline.model
    },
    Structure.Page+
  }
# <div data-type="document-title" id="auto_7a6f73c5-5378-408b-986f-e54416e12ad0_72544">Preface</div>
Structure.Page.Title =
  element div {
    attribute data-type { "document-title" },
    # id is not required for individual CNXML->HTML pages but *is* required for book conversion
    id.attrib,
    Inline.model
  }
# <div data-type="abstract" id="auto_7a6f73c5-5378-408b-986f-e54416e12ad0_78256">
Structure.Page.Abstract =
  element div {
    attribute data-type { "abstract" },
    # id is not required for individual CNXML->HTML pages but *is* required for book conversion
    id.attrib,
    Flow.model
  }
Structure.PageMetadata =
  element div {
    attribute data-type { "metadata" },
    Structure.Metadata.Title,
    Structure.Metadata.Uri,
    Structure.Metadata.ShortId,
    Structure.Metadata.Authors,
    Structure.Metadata.Publishers,
    Structure.Metadata.Permissions,
    Structure.Metadata.Description,
    Structure.Metadata.Keywords,
    Structure.Metadata.Subjects?,
    Structure.Metadata.Resources?
  }
Structure.Metadata.Title =
  element h1 {
    attribute data-type { "document-title" },
    attribute itemprop { "name" },
    Inline.model
  }
Structure.Metadata.Uri =
  element span {
    attribute data-type { "cnx-archive-uri" },
    attribute data-value { UUID-and-version.datatype }
  }
Structure.Metadata.ShortId =
  element span {
    attribute data-type { "cnx-archive-shortid" },
    attribute data-value { ShortId.datatype }
  }
Structure.Metadata.Authors =
  element div {
    attribute class { "authors" },
    (# Allow text like "Edited by:" intermingled in the DOM
     text?,
     Structure.Metadata.Authors.Item)+,
    # Allow text like "Edited by:" intermingled in the DOM
    text?
  }
Structure.Metadata.Authors.Item =
  element span {
    attribute data-type { "author" },
    id.attrib.required,
    attribute itemprop { "author" },
    attribute itemscope { "itemscope" },
    attribute itemtype { "http://schema.org/Person" },
    element a {
      attribute data-type { "cnx-id" },
      attribute href { Text.datatype },
      attribute itemprop { "url" },
      Text.datatype
    }
  }
Structure.Metadata.Publishers =
  element div {
    attribute class { "publishers" },
    (# Allow text like "Edited by:" intermingled in the DOM
     text?,
     Structure.Metadata.Publishers.Item)+,
    # Allow text like "Edited by:" intermingled in the DOM
    text?
  }
# Copy/Pasta from Book.Metadata.Authors.Item
# TODO: Combine these somehow.
Structure.Metadata.Publishers.Item =
  element span {
    attribute data-type { "publisher" },
    id.attrib.required,
    attribute itemprop { "publisher" },
    attribute itemscope { "itemscope" },
    attribute itemtype { "http://schema.org/Person" },
    element a {
      attribute data-type { "cnx-id" },
      attribute href { Text.datatype },
      attribute itemprop { "url" },
      Text.datatype
    }
  }
Structure.Metadata.Permissions =
  # change class="permissions" to some other attribute
  element div {
    attribute class { "permissions" },
    Structure.Metadata.Copyrights?,
    Structure.Metadata.License
  }
Structure.Metadata.License =
  element p {
    attribute class { "license" },
    text,
    # <a data-type="license" itemprop="...">
    element a {
      attribute data-type { "license" },
      attribute href { URI.datatype },
      attribute itemprop { "dc:license,lrmi:useRightsURL" },
      text
    }
  }
Structure.Metadata.Copyrights =
  element p {
    attribute class { "copyright" },
    # Copyright:
    text,
    # <span data-type="copyright-holder" id="copyright-holder-1" itemprop="copyright-holder" itemscope="itemscope" itemtype="http://schema.org/Person">
    #   <a data-type="cnx-id" href="cnxecon" itemprop="url">OpenStax Economics</a>
    # </span>
    Structure.Metadata.Copyrights.Item+
  }
Structure.Metadata.Copyrights.Item =
  element span {
    attribute data-type { "copyright-holder" },
    attribute itemprop { "copyright-holder" },
    attribute itemscope { "itemscope" },
    attribute itemtype { "http://schema.org/Person" },
    element a {
      attribute data-type { "cnx-id" },
      # TODO: This attribute is invalid. the element should not be a link
      attribute href { UserLogin.datatype },
      attribute itemprop { "url" },
      UserName.datatype
    }
  }
# <span data-type="cnx-archive-uri" data-value="69619d2b-68f0-44b0-b074-a9b2bf90b2c6@11.332"/>
# <span data-type="cnx-archive-shortid" data-value="aWGdK2jw@11.332"/>
Structure.Book.Metadata =
  element div {
    attribute data-type { "metadata" },
    Structure.Metadata.Title,
    Structure.Metadata.Uri,
    Structure.Metadata.ShortId,
    Structure.Metadata.Authors,
    Structure.Metadata.Publishers,
    Structure.Metadata.PrintStyle,
    Structure.Metadata.TranslucentBinding?,
    Structure.Metadata.Permissions,
    Structure.Metadata.Description,
    Structure.Metadata.Subjects
  }
# <div class="print-style">
# Print style:
# <span data-type="print-style">ccap-economics</span>
# </div>
Structure.Metadata.PrintStyle =
  element div {
    attribute class { "print-style" },
    text,
    # "Print style:"
    element span {
      attribute data-type { "print-style" },
      TODO.enum.datatype
    }
  }
Structure.ChapterMetadata =
  element div {
    attribute data-type { "metadata" },
    Structure.Metadata.Title,
    Structure.Metadata.TranslucentBinding,
    Structure.Metadata.Permissions
  }
Structure.Metadata.TranslucentBinding =
  # TODO: remove me. why is this here?
  element span {
    attribute data-type { "binding" },
    attribute data-value { "translucent" }
  }
Structure.Metadata.Description =
  element div {
    attribute class { "description" },
    attribute data-type { "description" },
    attribute itemprop { "description" },
    Flow.model+
  }
Structure.Metadata.Keywords =
  # TODO: This should probably be wrapped in an element
  element div {
    attribute data-type { "keyword" },
    attribute itemprop { "keywords" },
    text
  }*
# <div data-type="subject" itemprop="about">Mathematics and Statistics</div>
Structure.Metadata.Subjects =
  element div {
    attribute data-type { "subject" },
    attribute itemprop { "about" },
    Subject.datatype
  }*
# <div data-type="resources" style="display: none"> <ul>
# <li><a href="971b6320e705d9c81cbad7f4a98148ab91456d3b">971b6320e705d9c81cbad7f4a98148ab91456d3b</a></li>        </ul>
Structure.Metadata.Resources =
  element div {
    attribute data-type { "resources" },
    # TODO: remove this attribute, maybe the whole element. What is it used for?
    attribute style { "display: none" },
    element ul { Structure.Metadata.Resources.Item+ }?
  }
Structure.Metadata.Resources.Item =
  element li {
    element a {
      attribute href { Sha.datatype },
      Sha.datatype
    }
  }
# The <nav id="toc">
Structure.Book.ToC =
  element nav {
    attribute id { "toc" },
    element ol {
      Structure.Book.ToC.LeafItem*,
      Structure.Book.ToC.InternalItem+,
      Structure.Book.ToC.LeafItem*
    }
  }
Structure.Book.ToC.LeafItem =
  element li {
    attribute cnx-archive-shortid { ShortId.datatype },
    attribute cnx-archive-uri { UUID-and-version.datatype },
    element a {
      attribute href { URI.datatype },
      text
    }
  }
Structure.Book.ToC.InternalItem =
  element li {
    element span { text },
    element ol {
      (Structure.Book.ToC.InternalItem | Structure.Book.ToC.LeafItem)+
    }
  }
# <meta content="en" data-type="language" itemprop="inLanguage"/>
# <meta content="MathML" itemprop="accessibilityFeature"/>
# <meta content="LaTeX" itemprop="accessibilityFeature"/>
# <meta content="alternativeText" itemprop="accessibilityFeature"/>
# <meta content="captions" itemprop="accessibilityFeature"/>
# <meta content="structuredNavigation" itemprop="accessibilityFeature"/>
# <meta content="2014-01-02T22:13:54Z" itemprop="dateCreated"/>
# <meta content="2016-11-04T17:17:06Z" itemprop="dateModified"/>
enum.attr.meta.itemprop =
  "inLanguage" | "accessibilityFeature" | "dateCreated" | "dateModified"
enum.attr.meta.data-type = "language"
# <meta content="2016-11-04T17:17:06Z" itemprop="dateModified"/>
meta.attlist &=
  # <meta content="en" data-type="language" itemprop="inLanguage"/>
  (attribute itemprop { "inLanguage" },
   attribute data-type {
     # TODO: Remove this attribute
     "language"
   })
  | # <meta content="MathML" itemprop="accessibilityFeature"/>
    
    # <meta content="LaTeX" itemprop="accessibilityFeature"/>
    
    # <meta content="alternativeText" itemprop="accessibilityFeature"/>
    
    # <meta content="captions" itemprop="accessibilityFeature"/>
    
    # <meta content="structuredNavigation" itemprop="accessibilityFeature"/>

    # <meta content="2014-01-02T22:13:54Z" itemprop="dateCreated"/>
    # <meta content="2016-11-04T17:17:06Z" itemprop="dateModified"/>
    attribute itemprop {
      "accessibilityFeature" | "dateCreated" | "dateModified"
    }

And then, fix HTML elements that are not allowed in a certain context (in the CNXML->HTML conversion)

  • p > div
  • p > table
  • p > figure
  • p > blockquote
  • ??? > ol

And then, clean up the metadata blocks:

  • duplicate id="publisher-1" attributes
  • <div class="..."> (does not have a data-type or some other attribute) so it is not validatable
  • invalid id="{UUID}" attributes (needs to start with a character, not a number)
  • ...

And then, begin restricting the elements and attributes:

  • remove elements that the CNXML->HTML conversion does not support
    • examples: <abbr>, <script>
    • restrict headings
    • restrict span/div
    • restrict inline elements (<abbr>, <acronym>, <dfn>)
  • make id attributes required for many elements
  • only allow certain element/attribute combinations
    • only allow <div data-type="exercise"> (not arbitrary divs)
    • support 1 <nav> with a restricted set of children
    • support data-numbering-style="..." only when a <ol> or a <div data-type="list">
    • ...

And then:

  • clean up the TODO comments
  • organize the RNG files
  • convert XML comments into <a:documentation> elements

@codecov
Copy link

codecov bot commented May 10, 2017

Codecov Report

Merging #9 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff          @@
##           master     #9   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           5      5           
  Lines          74     74           
=====================================
  Hits           74     74

java -jar ./cnxml/jing.jar ./textbook-html/textbook-html.rng
/path/to/cnx-rulesets/data/econ-raw.xhtml | grep "error: attribute"
java -jar ./cnxml/jing.jar ./textbook-html/textbook-html.rng
/path/to/cnx-rulesets/data/econ-raw.xhtml | grep "unknown element"
@philschatz
Copy link
Member Author

yay, down to 1828 errors (in the econ book) from like 50k or so. I think the remaining errors require changes to both:

  • cnxml->HTML conversion (ie <table> inside a <p>)
  • cnx-epub or whichever repo adds the metadata blocks for each PageModule
    • invalid id attributes for data-type="page"
    • duplicate id's for publisher-1 (I think they do not need an id attribute)

@reedstrm where would I look to change the metadata blocks?

@reedstrm
Copy link
Contributor

re: <div data-type="metadata"> Hmm,export_epub code, which uses the formatters in https://github.com/Connexions/cnx-epub/blob/master/cnxepub/formatters.py

@reedstrm
Copy link
Contributor

to get an epub to validate, try cnx-archive-export_epub --format baked my_ident_hash_here //etc/cnx/archive/app.ini my.epub on tea.cnx.org

@philschatz
Copy link
Member Author

philschatz commented May 15, 2017

@reedstrm Hmm, I tried and got the following:

cnx-archive-export_epub --format raw 69619d2b-68f0-44b0-b074-a9b2bf90b2c6 //etc/cnx/archive/app.ini my.epub
cnx-archive-export_epub: command not found

I also tried running the following but nothing showed up:

find / -name "cnx-archive-export_epub"

Where would I find that script on tea.cnx?

@reedstrm
Copy link
Contributor

@philschatz check your scripts dir: the archive venv: /var/cnx/venv/archive/bin , I think.

@philschatz
Copy link
Member Author

Thanks! that worked. I checked the econ book (69619d2b-68f0-44b0-b074-a9b2bf90b2c6) and the only thing that errored was the <span> tags in the ToC (they cannot have no attributes).

Of course, I'm only on the 1st part (widen the scope of validation so that books pass through). The 2nd part is "narrow the scope of validation so that books are structurally sound".

For example, right now an "exercise solution" can exist anywhere but it should only occur inside an exercise.

@philschatz
Copy link
Member Author

openstax/rhaptos.cnxmlutils@8dcf3d1 contains all the fixes to get algebra-intermediate-book to validate the CNXML->HTML conversion

@philschatz
Copy link
Member Author

Astronomy now validates as well

@philschatz
Copy link
Member Author

philschatz commented Jun 17, 2017

There are now autogenerated docs (from the .rng files) that can be viewed/reviewed at ./textbook-html/docs/

This should help in code-review and as a reference to link to

@philschatz
Copy link
Member Author

Last night I tried rebuilding all the textbooks in https://github.com/philschatz/textbooks and ran into validation errors so it appears that this PR needs to be updated.

Copy link
Contributor

@reedstrm reedstrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't parsed the rng in detail, but the autogenerated docs seem accurate - assuming the comments match the code, we're good to go on this.

@philschatz
Copy link
Member Author

@reedstrm Oops, I was wrong... they weren't "validation errors", this correctly caught the following:

So, yeah, the grammar matches the XSLT in openstax/rhaptos.cnxmlutils#157

@reedstrm
Copy link
Contributor

reedstrm commented Nov 8, 2017

Ah, and the most recent commit doesn't pass tests for some reason?

@philschatz
Copy link
Member Author

I'm a little confused by the question. The most-recent commit does pass tests (the Travis-CI checkmark).

Yesterday a book was failing validation and I thought it was because I needed to update the .rng files but it turns out the content was bad so validation correctly failed.

@brittweinstein
Copy link

CLOSING. If this presents itself as an issue again, we will reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants