Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerals in abbreviations #155

Closed
ilarischeinin opened this issue Nov 16, 2018 · 5 comments
Closed

Numerals in abbreviations #155

ilarischeinin opened this issue Nov 16, 2018 · 5 comments
Milestone

Comments

@ilarischeinin
Copy link

I have a case that I cannot figure out how to get right. I don't think it's an exotic one, so I'd imagine it must be already possible, but just can't get my head around how to specify it with the available parameters.

I want to convert (from snake case) "t2d_status" to lower camel case "t2dStatus". The problem is that no matter what I try, I get "t2DStatus", i.e. with a capital "D" whereas I need a lowercase "d".

library(snakecase)
to_lower_camel_case("t2d_status")
#> [1] "t2DStatus"

I tried to specify "t2d" as an abbreviation so it wouldn't get broken down:

to_lower_camel_case("t2d_status", abbreviations = "t2d")
#> [1] "t2DStatus"

Also tried to specify to keep numerals as is:

to_lower_camel_case("t2d_status", numerals = "asis")
#> [1] "t2DStatus"

And to change sep_in to just "_":

to_lower_camel_case("t2d_status", sep_in = "_")
#> [1] "t2DStatus"

Created on 2018-11-16 by the reprex package (v0.2.1)

None of these seem to help (nor my attempts with parsing_option or transliteration), so could you please point me to the right direction here? Thank you.

I did try to go through the issue tracker to see if a case like this had popped up before, but that was kind of difficult as so many issues are not very descriptive, but more of record keeping on things to be implemented.

@Tazinho
Copy link
Owner

Tazinho commented Nov 16, 2018

Thanks for reporting this. In theory you are right with the first approach. The implementation of abbreviations is just too naive atm.

Currently matches of abbreviations will be surrounded internally by underscores to ensure they are recognized as substrings. However, the substrings (abbreviations) are then parsed further and in your case t2d will be parsed into 3 substrings (because of the number).

I think a perfect solution would be to ignore the abbreviations during the parsing step. However, I am not sure how to implement this in an elegant way and will have to think a bit about that.

@Tazinho
Copy link
Owner

Tazinho commented Nov 17, 2018

Possible implementation idea:
string -> abbreviations ->

  • Match only specific sequences for abbreviations: 1. Sequences of upper case letters (and numerals) and then sequences of lower case letters (and numerals). (Possibly also test abbreviations to be not in mixed case)
  • Matches will be replaced with
    • the result of paste(ABR, Sample(LETTERS, 3), digit). numeral digits will be replaced by abcdefghi. The replacement will be surrounded by underscores.
    • Before that, it will be checked that paste(ABR, sample(LETTERS, 3)) is not contained within the string. If so, the sample length will be increased by one until if fits.

sep_in -> parsing_option -> split ->

  • Rereplacement of the placeholders by the original abbreviations

-> ...

Edit: otherwise it might be possible to work around the numerals parsing

The third and possibly best approach would be to split first on the abbreviations, mark the abbreviations and then split a second time on the parsing of the non-abbreviation substrings. However, will need to evaluate this approach in a new dev branch first.

@Tazinho
Copy link
Owner

Tazinho commented Jan 17, 2019

Once I get to this the process must probably look like this:

  • check if abbreviations were supplied
  • if true, check for matches
  • if matches occur, split the regarding strings and provide information on which of the substrings are abbreviations
  • parse the substrings (which aren’t abbreviations) further...

@Tazinho Tazinho added this to the PARSING milestone Feb 26, 2019
@Tazinho
Copy link
Owner

Tazinho commented May 23, 2019

The above still sounds like significant overhead. Maybe the following could work:

  • replace spaces by "_" in the very beginning in this way spaces can be used to mark abbreviations
  • in order to mark the side of each abbreviations use the pattern"_ labbreviationr _"
  • now for each parsing helper include a negative lookbehind (<not space followed by l, but it is ok if an r followed by space>) and a similar negative lookahead so that only substrings outside of the abbreviation "scopes" are parsed
  • Look into the current implementation of the numerics argument (here also some markers including spaces were used) and see if problems can be resolved...

@Tazinho
Copy link
Owner

Tazinho commented May 24, 2019

Implemented in devversion-01 branch for now (almost as mentioned in the last post; not yet tested; also need to remove some overhead introduced by the current verbose implementation):

  • replace spaces by "_" in the very beginning in this way spaces can be used to mark abbreviations
  • in order to mark the side of the abbreviations use the pattern"_ l labbreviationr r _" (ensure that the pattern only occures once for each abbreviation and correct wrong cases via gsub)
  • now split each string by the pattern "\sl|r\s". Now apply the parsing steps as one function inside an lapply. For each string use an ifelse to only parse those strings that don't start with "\sl".

Open steps:

  • look more closely into the markers for digits to ensure that these implementations don't collide.
  • write more tests
  • improve the speed (possibly one implementation without abbreviations, one with abbreviations that don't contain special characters/digits, one for any abbreviation)
    • better just introduce logical subsetting and lapply once over only those cases that contain abbreviations and once (without lapply) over the other entries of string.
  • enable to parse more abbreviations (currently only abbreviatinos in the form of "blaABBR" or "ABBRbla" ["ABBR"/"abbr" can contain/start/end with any combination of characters but there must not be an switch from upper to lower case or vice verse] are parsed (and protected from other parsing options) correctly)

Tazinho pushed a commit that referenced this issue May 24, 2019
Tazinho pushed a commit that referenced this issue May 25, 2019
@Tazinho Tazinho closed this as completed May 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants