Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bracing for words containing unbraced capital letters #118

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mcsitter
Copy link

@mcsitter mcsitter commented Jan 5, 2022

Thank you very much for this awesome package. I implemented the option described in #24 with the following behavior:

  • Escape all upper-case letters at the beginning of words
  • Escape all words with >1 upper-case letter, if it starts with a capital letter
  • Don't do anything to words that are preceded by a curly brace
    there are however some things to consider regarding the implementation, which I would like to discuss before merging.

Behaviour

test --> test
Test --> {T}est
TESt --> {TEst}
tESt --> tESt

The example shows, that a word that does not have a capital letter at the start, but more than one inside will not be capitalized. If we want to do that, we could just say: escape any group (1-X) of uppercase letters.
A common pattern is also to escape like this:

Test --> {Test}

This will not be changed, however, it is quite common. Should the given pattern be changed to the default of bibtex-tidy? Maybe the option to choose:

  1. Escape the whole word if it contains any uppercase letter
  2. Escape the uppercase letter if it is at the beginning of the word, the whole word if there are any other uppercase letters in it (current implementation)
  3. Escape any uppercase letter or group of consecutive uppercase letters (i have the code ready. A simple change in the regex to set the starting letter from [A-Z] to \w

Allow configuring fields?

As of now, if the option is turned on (default: false), the escaping of the upper case is applied to all fields or rather all fields containing any capital letters. The escaping of capital letters will thus also be applied to Journal, Series, Author, or any other field, which may be undesirable. I think there are 3 ways to deal with this:

  1. Keep it the way it is
  2. Allow configuration of the fields it should be applied to like in the sortFields option. If we can't find sensible defaults, this is probably best, as we could just have the defaults there, but editable (think sort fields option).
  3. Configure the option to only apply to certain fields. I'm thinking of title, shorttitle and booktitle. Happy to count any others in! If we can find sensible defaults, this may be preferred due to the required UI space of option 2.

Option name

Right now the chosen name for the cli option is --escape-uppercase. This can lead to unclear cli arguments like bibtex-tidy --escape-uppercase --no-escape . The escaping of special characters and of uppercase letters has possibly confusable names throughout the codebase.

Tests

I will of course add a testcase, once the desired behaviour is defined!


BTW: Thanks for the understandable structure of the codebase. I don't know the structure of node packages that well, but could easily find the places to add the option, namely docs/index.html, docs/index.ts, src/format.ts, src/optionDefinition.ts, and src/utils.ts.

Thank you very much for the feedback!
Let me know, whether you fear a cluttered UI due to a way too high amount of options. We may not want to be too opinionated, but that means implementing more options. I might tackle some other issues as well but would like your stance on this. Maybe a CONTRIBUTING.md may be a good idea, to also document how to add more options and capture the philosophy.

Closes #24

@FlamingTempura
Copy link
Owner

Thanks for your contributions, and thanks for documenting them so thoroughly!

Thank you very much for this awesome package. I implemented the option described in #24 with the following behavior:

  • Escape all upper-case letters at the beginning of words
  • Escape all words with >1 upper-case letter, if it starts with a capital letter
  • Don't do anything to words that are preceded by a curly brace
    there are however some things to consider regarding the implementation, which I would like to discuss before merging.

Behaviour

test --> test
Test --> {T}est
TESt --> {TEst}
tESt --> tESt

The example shows, that a word that does not have a capital letter at the start, but more than one inside will not be capitalized. If we want to do that, we could just say: escape any group (1-X) of uppercase letters. A common pattern is also to escape like this:

Test --> {Test}

This will not be changed, however, it is quite common. Should the given pattern be changed to the default of bibtex-tidy? Maybe the option to choose:

  1. Escape the whole word if it contains any uppercase letter
  2. Escape the uppercase letter if it is at the beginning of the word, the whole word if there are any other uppercase letters in it (current implementation)
  3. Escape any uppercase letter or group of consecutive uppercase letters (i have the code ready. A simple change in the regex to set the starting letter from [A-Z] to \w

I'd be in favour of option 1 because (a) it's consistent behaviour for both cases ({Apple} {arXiv}) and (b) I find the output slightly more readable.

Out of interest - what might be the potential pitfalls of double bracing the entire value?

journal = {{Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}}

Another consideration (perhaps for a separate PR) is to only wrap words which have uppercase letters not at the start. This way we still allow the bibtex style to make casing consistent except where the uppercase is clearly intentional.

title = {Something to do with {arXiv}},
journal = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition}

Finally, we have an issue with some bibtex exports containing fully uppercases values. I don't have any examples to hand but I have seen examples like this come from Scholar exports:

journal = {PROCEEDINGS OF THE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION}

It may be prudent to skip wrapping capitals if the entire value is uppercase. Although we do have some that might be valid 🤷:

journal = {CHI 2018}

Allow configuring fields?

As of now, if the option is turned on (default: false), the escaping of the upper case is applied to all fields or rather all fields containing any capital letters. The escaping of capital letters will thus also be applied to Journal, Series, Author, or any other field, which may be undesirable. I think there are 3 ways to deal with this:

  1. Keep it the way it is
  2. Allow configuration of the fields it should be applied to like in the sortFields option. If we can't find sensible defaults, this is probably best, as we could just have the defaults there, but editable (think sort fields option).
  3. Configure the option to only apply to certain fields. I'm thinking of title, shorttitle and booktitle. Happy to count any others in! If we can find sensible defaults, this may be preferred due to the required UI space of option 2.

Option 1 seems overkill - e.g. would it really be needed for abstract? It might mess up author/editor/doi/url/crossref values too, because they have to be formatted in a certain way.

Option 2 seems reasonable, but with different defaults. e.g. title, shorttitle, booktitle, journal, publisher.

Option name

Right now the chosen name for the cli option is --escape-uppercase. This can lead to unclear cli arguments like bibtex-tidy --escape-uppercase --no-escape . The escaping of special characters and of uppercase letters has possibly confusable names throughout the codebase.

I agree "escape" is potentially confusing. How about --brace-capitals or --protect-capitals?

It's worth noting that your current solution appears to be limited to the latin alphabet. I'm OK with this as a limitation because it would be harder to write a regexp for other alphabets, but it would be good to see it noted in the code.

Tests

I will of course add a testcase, once the desired behaviour is defined!

BTW: Thanks for the understandable structure of the codebase. I don't know the structure of node packages that well, but could easily find the places to add the option, namely docs/index.html, docs/index.ts, src/format.ts, src/optionDefinition.ts, and src/utils.ts.

Thank you very much for the feedback! Let me know, whether you fear a cluttered UI due to a way too high amount of options. We may not want to be too opinionated, but that means implementing more options. I might tackle some other issues as well but would like your stance on this. Maybe a CONTRIBUTING.md may be a good idea, to also document how to add more options and capture the philosophy.

Great idea capturing the philosophy in CONTRIBUTING.md. It was originally very opinionated as it was literally just a script for cleaning up my theses bibliography, but nowadays I tend to cater to requests for greater customisability.

I'm not too worried about the UI getting cluttered. We can refine it to hide details. Maintaining backward compatibility is the key thing. So if we get this new option into production, we can't really change how it behaves (unless it's a bug fix).

@mcsitter
Copy link
Author

mcsitter commented Jan 7, 2022

Thanks for your contributions, and thanks for documenting them so thoroughly!

Appreciated!

Out of interest - what might be the potential pitfalls of double bracing the entire value?

This describes quite well what to do or not to do.
tl;dr:

  1. it is in general preferable to enclose entire words
  2. it is definitely not a good idea to enclose an entire title in braces

I think we should actually follow that style, as you suggested. The sensible default would be, to not change the bracing of capitals, that are already braced. I don't know however how this should be combined with #120 . Should the combination mean, that there are no braces in the end, or that the bracing is changed to enclose entire words. One of those would happen if both are applied, except we ignore one option if the other is turned on. We should also make sure that the output is stable across runs.

I would actually want to be able to unify the bracing style. Maybe a seperate toggle could be something like a "force" flag regarding this option.

only wrap words which have uppercase letters not at the start

I can open a PR if you want.

fully uppercases values

Isn't this sufficiently handled by the option to remove all caps? I don't know, how we would make the judgement for correct or incorrect bracing.

Option 2 seems reasonable, but with different defaults. e.g. title, shorttitle, booktitle, journal, publisher.

Let me know if there are any others i should add.

How about --brace-capitals or --protect-capitals?

Brace is probably the wording to go for. I will try to come up with something precise, that allows for the implementation of other options regarding bracing without causing unnecessary confusion.

your current solution appears to be limited to the latin alphabet

Will be implemented using unicode categories.

@mcsitter mcsitter changed the title Add escape uppercase letters option Add bracing for words containing unbraced capital letters Jan 7, 2022
@@ -1,5 +1,14 @@
import { specialCharacters } from './unicode';

export function braceWordsWithCapital(str: string): string {
return str.replace(
/(?<![\{\p{L}])\p{L}*\p{Lu}\p{L}*(?![\p{L}\}])/gu,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome regexp! Could you add a comment to explain what it does?

@FlamingTempura
Copy link
Owner

Agree with your design choices and code looks good. Unicode categories are cool!

Could you document the design choices somewhere in the code (maybe above the braceWordsWithCapital function):

  • We brace each word, not the whole value (https://texfaq.org/FAQ-capbibtex) or individual characters (which would reduce readability)
  • In some cases it may be undesirable to brace words in Title Case Values. For now we are not tackling this, but in future we could add an additional option to not brace words where only the first character is uppercase.
  • Sometimes bibtex exports may include values that are fully capitalised. The dropAllCaps option can be used, which bibtex-tidy will apply before braceCapitalWord.

For the option name I'd argue braceCapitals is better because it's more succinct for CLI.

Are you happy to write some tests for this?

@arkivm
Copy link

arkivm commented Jan 19, 2023

Nice feature. Thank you.
However, it capitalizes the urls as well. I think we should escape the content within the \url{} block.

@misc{sel4-perf,
-  title         = {seL4 Performance},
-  howpublished  = {\url{https://sel4.systems/About/Performance/}}
+  title         = {{seL}4 {Performance}},
+  howpublished  = {\url{https://sel4.systems/{About}/{Performance}/}}
}

@fahim831
Copy link

This looks like a great thread with good ideas of implementation of this feature. I've been running into this issue lately and was wondering if this feature was implemented yet or will be soon?

@mcsitter
Copy link
Author

mcsitter commented Nov 3, 2023

Hey, I know I have been working on it, but sadly could not get around to finishing this. This also will probably take a while. So if anyone would like to finish this up, this would be a relief to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to escape uppercase letters
4 participants