Skip to content
This repository has been archived by the owner on Feb 13, 2020. It is now read-only.

Regular Expressions and R #226

Closed
AdrianLJones opened this issue Nov 5, 2015 · 10 comments
Closed

Regular Expressions and R #226

AdrianLJones opened this issue Nov 5, 2015 · 10 comments

Comments

@AdrianLJones
Copy link

Is there some particularity to the syntax for R that I don't know about?

In the Regex tester I put in \D|\d{3,} with the idea of selecting all characters or digits that are longer than length one or two. My plan was to turn those to null as a way to clean the Age variable in the candy data set, and this seemed to work in the test bed.

But when I try 'candy_age_time <- candy_age_time %>% mutate(clean.Age = gsub("\D|\d{3,}", '', Age))'
I get Error: '\D' is an unrecognized escape in character string starting ""\D"

What is going wrong?

Thanks,

Adrian

@samhinshaw
Copy link

@AdrianLJones I was having this same issue. As @jennybc mentioned in class today, in R you have to escape the escape: "\\D|\\d{3,}". See the notes from @ksamuk's lecture especially "Part 2: Regex in R"

@ksamuk
Copy link

ksamuk commented Nov 6, 2015

@samhinshaw's got it!

@csiu
Copy link

csiu commented Nov 6, 2015

Along the same line of thought -- i.e. weirdness of "Regular Expression and R" -- why does this

grep("^[,\\d]+$", c(letters, "123", "1,234", "mints"), value = TRUE)

results to "d" instead of "1,234"?

@ksamuk
Copy link

ksamuk commented Nov 6, 2015

This is going to be unsatisfying, but I've never been able to get the \w \d family of character classes to work as part of another bracketed character class (like your example which is \d and ","). In those cases, you can use [:digit:] or just 0-9 (less typing!) like:

> grep("^[,0-9]+$", c(letters, "123", "1,234", "mints"), value = TRUE)
[1] "123"   "1,234"
> grep("^[,[:digit:]]+$", c(letters, "123", "1,234", "mints"), value = TRUE)
[1] "123"   "1,234"

I think that problem might be related to the fact that items in square brackets are interpreted literally, e.g. [*] matches a literal asterix. It would then maybe make sense that in square brackets you would only have to single escape the \d, but that also doesn't work.

@ksamuk
Copy link

ksamuk commented Nov 6, 2015

I guess a weird hack that would allow you to use \w, \d is:

> grep("^(\\d|,)+$", c(letters, "123", "1,234", "mints"), value = TRUE)
[1] "123"   "1,234"

But that's getting a bit odd, and would get messy very quickly (each new additional character would need another |).

@kevinushey
Copy link

I generally find that regular expressions behave much more predictably (ie, as I would expect them to work) when using perl = TRUE, e.g.

> grep("^[,\\d]+$", c(letters, "123", "1,234", "mints"), value = TRUE, perl = TRUE)
[1] "123"   "1,234"

@AdrianLJones
Copy link
Author

Thanks for the help, I've encountered another problem.
Now I'm trying to clean the candy names. I saw Jenny was able to remove the curly quotes that pepper the names and replace them with regular straight quotes. After cutting and pasting the character in (how did she make it in the first place?) to my code. I got no effect with new_names3 <- gsub('[’"]','', new_names2)
Is the problem that I'm using gsub instead of str_replace? If so, why? If not, how do I get rid of the curly quotes?

Thanks,

Adrian

@ksamuk
Copy link

ksamuk commented Nov 11, 2015

The code you posted removes the curly quotes outright (the second argument of gsub you have there is 'nothing'). To replace the quotes with a straight quote, you need to put an escaped straight quote in quotes as the second argument. See below:

> (string <- c("’curly’", "’quotes’"))
[1] "’curly’"  "’quotes’"
> gsub('[’"]','', string)
[1] "curly"  "quotes"
> gsub('[’"]','\'', string)
[1] "'curly'"  "'quotes'"

@AdrianLJones
Copy link
Author

I'm sorry, I wasn't clear. I did not want to replace the curly quotes with anything, so that was not my confusion. My problem is that after trying to remove curly quotes with the code 'new_names3 <- gsub('[’"]','', new_names2)' there were still curly quotes present and I don't know why.
The quotes were in:
Box’o’ Raisins
Reese’s Peanut Butter Cups
Peanut MM’s

Interestingly your code works just fine at removing your curly quotes, but not the quotes from those three problematic candies.

I think maybe this needs assistance from the office hours.

Thanks

Adrian

@jennybc jennybc closed this as completed Aug 30, 2016
@Vamshi-dhar
Copy link

@csiu
you can try this code to get "123" , "1,234"
grep("\\d+$", c(letters, "123", "1,234", "mints"), value = TRUE)
You cant use \d within [ ] as it understands to look for "d" instead, and results in "d".
Hope it helps..
Cheers..!!!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants