Skip to content

Commit

Permalink
Complete overhaul.
Browse files Browse the repository at this point in the history
  • Loading branch information
GeeLaw committed Jul 28, 2019
1 parent 039efb4 commit dc5ea22
Show file tree
Hide file tree
Showing 38 changed files with 4,922 additions and 1,903 deletions.
16 changes: 15 additions & 1 deletion Build-Project.ps1
Expand Up @@ -68,9 +68,23 @@ Process
);
$local:Evaluator = [System.Text.RegularExpressions.MatchEvaluator]{
Param ([System.Text.RegularExpressions.Match]$match)
$local:target = $match.Groups[2].Value;
$local:Content = [System.IO.Path]::Combine($PSScriptRoot,
'tmp', $match.Groups[2].Value);
$Content = [System.IO.File]::ReadAllText($Content);
If ($target.ToLowerInvariant().EndsWith('.js'))
{
$Content = [System.IO.File]::ReadAllText($Content);
}
Else
{
$target = [System.IO.Path]::Combine($Content, '__namespace.js');
$Content = Get-ChildItem `
-Path ([System.IO.Path]::Combine($Content, '*.js')) `
-Exclude '__namespace.js' |
ForEach-Object { [System.IO.File]::ReadAllText($_.FullName) };
$Content = $Content -join "`n;";
$Content += "`n;" + [System.IO.File]::ReadAllText($target);
}
$Content = $match.Groups[1].Value + $Content + $match.Groups[3].Value;
Return $Content;
};
Expand Down
3 changes: 1 addition & 2 deletions README.md
Expand Up @@ -2,13 +2,12 @@

BibTeX parser written in TypeScript in rigorous but dumb way.

This project is motivated by personal needs. Goal is to support well-structured (debatable!) BibTeX parsing and rendering. Speed and organizedness are prioritized. Code should always be written by following some documentation (though different versions of documentation contradict each other) instead of translating other implementations. Non-goals include supporting peculiar/pedantic features that doesn't seem useful to me.
This project is motivated by personal needs. Goal is to support well-structured (debatable!) BibTeX parsing and rendering. Speed and organizedness are prioritized. Code should always be written by following some documentation (though different versions of documentation contradict each other) instead of translating other implementations.

It would be amazing if you find it useful.

## To-Dos

- [ ] Add tests.
- [ ] Implement name resolution, purification and transformation.
- [ ] Implement HTML rendering with limited support of LaTeX commands.
- [ ] Implement standard styles.
75 changes: 75 additions & 0 deletions docs/BST.md
@@ -0,0 +1,75 @@
# BST functions mapped to BibTeX-TS

BibTeX-TS aims to provide all the functionalities available in the BST language. Though currently I haven't implemented a BST interpreter (nor do I plan to), the library aims to provide intuitive and mostly compliant implementation of BST functions. (Note some pedantic features are considered harmful. There will be examples.)

## `add.period$` function

In BST language, this function takes a string as input and outputs a string. It appends a period to string if it doesn't end with `.?!` after removing braces.

BibTeX-TS counterpart is `IsCompleteSentence` method on a `Literal` object. It determines whether a period should **not** be added, i.e., whether it ends with `.?!`.

From what I observe, BibTeX (BST) considers a string as ending with `.?!` if it matches regular expression `[.?!][}]*$`, i.e., `.{}` does *not* end with a period. BibTeX-TS currently implements this as the testing result of this regular expression.

## `change.case$` function

In BST language, this function takes a string, another string that specifies the target case, and outputs a string.

- If the target case is `U` or `u` (upper case), the input string will be converted to upper case, except for LaTeX command names inside a special character and anything inside balanced braces that are not inside a special character. The exceptions are `\aa\ae\oe\o\l\i\j\ss` commands inside a special character. They're converted to their equivalent upper-case forms.
- If the target case is `L` or `l` (lower case), the input string will be converted to lower case, except for LaTeX command names inside a special character and anything inside balanced braces that are not inside a special character. An exceptions are `\AA\AE\OE\O\L` commands inside a special character. They're converted to their equivalent lower-case forms.
- If the target case is `T` ot `t` (title case), the input string will be converted to lower case, except for *the first character* and *the first non-whitespace character after a colon with trailing whitespace*, LaTeX command names inside a special character, and anything inside balanced braces that are not inside a special character. The exceptions are `\AA\AE\OE\O\L` commands inside a special character. They're converted to their equivalent lower-case forms *if the special character is not the first character nor the first non-whitespace character after a colon with trailing whitespace*.

Note that if you change `A: {}B` to title case, the result is `A: {}b` because `b` is not the first non-whitespace character after the colon with whitespace.

There are some pedantic features in the BST `change.case$` function.

- It doesn't fully handle the conversion of `\i\j\ss` with other LaTeX commands. For example, `{\ae\ss}` will be converted into `{\AESS}`, instead of the desired `{\AE SS}`. (It does handle the case `{\"\i \j}` being converted to `{\"IJ}`. Note how the space between `\i` and `\j` disappears.)
- It doesn't preserve the eligiblity as a special character. For example, `{\ss}` will be converted into `{SS}`, which is no longer a special character.
- If a special character is the first character or the first character after a colon with whitespace, all characters inside it have their cases preserved. For example, `{\AE\OE}` is `{\AE\OE}` when converted to title case, **not** `{\AE\oe}`.

BibTeX-TS counterpart is `ToXxxCase` and `ToCase` methods on `Literal` instance. BibTeX-TS makes effort to make sure the space between LaTeX commands and non-commands are correctly inserted or removed, and that a special character remains a special character after conversion (by adding `\relax`, e.g. `{\ss}` in upper case is `{\relax SS}`). It faithfully implements the pedantic feature about case preservation inside a special character.

## `format.name$` function

BST `format.name$` function formats a name according to a format string. BibTeX-TS approaches the task by decomposing it into 3 subtasks:

1. Parse names out of a `Literal` using `BibTeX.ParsePersonNames` method.
2. Parse name formats out of a usual string using `BibTeX.ParsePersonNameFormat` method.
3. Format a `PersonName` using `PersonNameFormat.Format` method.

To be written...

## `purify$` function

BST `purify$` function purifies a string. BibTeX-TS counterpart are the `Purified` and `PurifiedPedantic` properties on `Literal` instances (also the `XyzPiece` objects). It should faithfully reimplement the effect of `text.length$`:

- Outside braces or inside braces that do not form a special character, tabs, hyphens and tildes outside braces are replaced with a space character. All non-alphanumerical non-space characters are removed.
- Inside a special character, LaTeX commands and non-alphanumerical character (including whitespace) are removed. The exceptions are `\AA\aa\AE\ae\OE\oe\O\o\L\l\i\j\ss`. `\AA` (resp. `\aa`) is converted to `A` (resp. `a`), and other such commands are converted to their respective names.

The BST function will remove all non-ASCII characters. BibTeX-TS provides two versions: It **not** remove any non-ASCII character when computing `Purified`, and it provides a pedantic version, `PurifiedPedantic`.

## `text.length$` function

BST `text.length$` function computes the length of a string (with balanced braces). BibTeX-TS counterpart is the `Length` property on `Literal` instances (also the `XyzPiece` objects). It should faithfully reimplement the effect of `text.length$`:

- Each character outside braces is counted as one character.
- Each special character is counted as one character (no matter how long it is or how many letters are inside it).
- Each non-brace character inside balanced braces that do not form a special character counts as one character.

Specifically, this means the length of `c{\~ab}{{\LaTeX}}` is 8 (`c`, special character and `\LaTeX`).

## `text.prefix$` function

BST `text.prefix$` function comptues the prefix of a string (with balanced braces) of specified length. BibTeX-TS counterparts are the `Prefix` and `PrefixRaw` methods on `Literal` instances. `Prefix` returns the prefix as a `Literal`, whereas `PrefixRaw` returns the prefix as a plain `string`. The latter avoids overhead of arranging the content into a `Literal` and can be used if the prefix is direcly handled as plain strings.

They should faithfully reimplement the effect of `text.prefix$`:

- Count length as defined by `text.length$`.
- The output will always have balanced braces. Unpaired opening braces are paired by appending as many closing braces to the end of string as needed.

Specifically, this means the prefix of length 4 of `c{\~ab}{{\LaTeX}}` is `c{\~ab}{{\L}}`.

## Why not BST interpreters in JavaScript?

The constructs that the BST language natively supports is very limited and it is cumbersome to program in it. Moreover, some operations don't have idiomatic efficient implementation, e.g., integer multiplication is implmented as repeated addition.

The BST language is used to produce output that the TeX typesetter would like, which is very limited. The original motive of creating this library is to be able to use BibTeX in my blog building system, which requires content be rendered as HTML. BibTeX-TS parses BibTeX databases into objects with methods, and the consumers are supposed to operate over the objects and let it retain that form, except when preparing for final output. Therefore, it suffices to implement the important BST functions and let the usual operations be handled by JavaScript.
71 changes: 71 additions & 0 deletions docs/README.md
@@ -0,0 +1,71 @@
# BibTeX-TS

To import the library, put the generated `/lib/bibtex.js` file somewhere and `const BibTeX = require('path/to/bibtex.js')`. The basic idea is that any publicly available classes and functions are consumable from JavaScript and should resist malformed input. Private constructions are available in `BibTeX._Privates` (as well as `obj._Privates` and `obj._MutablePrivates` for objects with private fields).

User familiar with the BST language can find a quick chart in [`BST.md`](BST.md).

## Methods

There are 4 methods available under `BibTeX` object.

- `ParseLiteral` parses a string into a `Strings.Literal` object by returning a `Strings.ParseLiteralResult` object. Inspect the returned object to see whether an error occurred and get the result.
- `ParseDatabase` parses the content of a `.bib` file and returns a `ObjectModel.ParseDatabaseResult` object. Inspect it to get the result.
- `ParsePersonNames` parses a `Strings.Literal` object into an array of `ObjectModel.PersonName`.
- `ParsePersonNameFormat` parses a `string` into an `ObjectModel.PersonNameFormat` object.

## `Strings` namespace

This namespace contains the string model used in BibTeX-TS. It has 3 classes representing fragments of strings, called *pieces*:

- `BasicPiece` represents a fragment of string that is outside braces and contains no braces.
- `SpCharPiece` represents a fragment of string that is considered a *special character*. It is a fragment that begins with `{\` and has balanced braces.
- `BracedPiece` represents a fragment of string enclosed in braces but that is not a special character. It is a fragment that begins with `{` not followed by `\` and has balanced braces.

For example, the string `{\relax Ch}ristopher learnt how to use {{\LaTeX}}.` consists of 4 pieces:

1. `{\relax Ch}` is a `SpCharPiece`. Its *value* is the fragment with the first and last braces removed, i.e, `\relax Ch`.
2. `ristopher learnt how to use ` is a `BasicPiece`. Its value is the fragment itself.
3. `{{\LaTeX}}` is a `BracedPiece`. Its value is the fragment with the first and last braces removed, i.e., `{\LaTeX}`.
4. `.` is another `BasicPiece`.

Pieces can be constructed using their corresponding constructors that accepts the desired *value* (not the fragment itself!). Each class also has its `Empty` static property that stores a canonical empty instance. Note that, however, an empty `SpCharPiece` is actually `{\relax}`, where `\relax` is the LaTeX command that does nothing.

Strings are represented using the `Literal` class. Such an instance can be constructed from the array of its pieces. The following properties are the most important:

- `Raw` stores a string that can be parsed into the equivalent `Literal` instance using `BibTeX.ParseLiteral`.
- `Pieces` provides access to individual pieces. It is guaranteed that the pieces will not have empty `BasicPiece`s nor consecutive `BasicPiece`s.

`Literal` also has a static `Empty` storing a canonical empty instance.

## `ObjectModel` namespace

This namespace contains several classes. Generally, instances of those classes should be parsed and not created by consumers.

`StringRef` and `StringExpr` represent string operation in `.bib` database file. A `StringExpr` consists of several summands, each of which is either a `Literal` or a `StringRef`. For example, `author = {A} # and # "B"` will be parsed into an `EntryData`, whose `Fields.author` is a `StringExpr` consisting of 3 summands, the first and the third being `Literal`s and the second being a `StringRef` to `and` (which is supposed to be defined by a `@string` command).

`Entry` and `EntryData` represent resolved/unresolved entry. By parsing a `.bib` file, you get several `EntryData` instances. By calling the `Resolve` method, you get the `Entry` of that `EntryData`. The difference is whether `Fields` contain `StringExpr`s or `Literal`s.

`BibTeX.ParseDatabase` returns its results as a `ParseDatabaseResult` object, from where you can inspect errors and results. The errors are represented as `ParseDatabaseError` objects. This means all errors will be reported (as opposed to all other parsing methods where only the first error is reported).

`PersonName` represents a name. This is a BST language concept. Names are parsed from `Literal`, where each name is separated by the word `and` surrounded by whitespace. Each name can take the form `First von Last` or `von Last, First` or `von Last, Jr, First`. The `PersonName` object stores the words of each part of the name as well as the separators between consecutive words. Use `BibTeX.ParsePersonName` to obtain instances of this class. For each name, the first error in parsing that name is reported.

`PersonNameFormatComponent` represents a name component in a name format. It's included in the public visible part mainly for `instanceof` testing.

`PersonNameFormat` is a reusable name format object. Use `BibTeX.ParsePersonNameFormat` to obtain an instance. Use `PersonNameFormat.Format` method to format a `PersonName`. See the following example:

```JavaScript
const BibTeX = require('./bibtex.js');
const fmt = '{f. }{vv }{ll}';
const format = BibTeX.ParsePersonNameFormat(fmt);
const name = BibTeX.ParsePersonName(BibTeX.ParseLiteral('Jean-Baptiste de La Salle').Result)[0];
// J.-B. de La~Salle
console.log(format.Format(name));
// (same)
console.log(name.Format(fmt));
```

The advantage of using a `PersonNameFormat` object is efficiency. Each time `PersonName.Format` is called, the string needs to be parsed. If the same format is used for many names, parse the format string into a `PersonNameFormat` and use `PersonNameFormat.Format`.

## `TeX` namespace

This namespace has one abstract class `SimpleHandler`. It is a sufficient template to handle most TeX rendering you need in a `.bib` file. A handler can be created by deriving the class and implementing `EatXxx` methods and `Finish` method.

0 comments on commit dc5ea22

Please sign in to comment.