Regular Expressions and R #226

AdrianLJones · 2015-11-05T19:44:04Z

Is there some particularity to the syntax for R that I don't know about?

In the Regex tester I put in \D|\d{3,} with the idea of selecting all characters or digits that are longer than length one or two. My plan was to turn those to null as a way to clean the Age variable in the candy data set, and this seemed to work in the test bed.

But when I try 'candy_age_time <- candy_age_time %>% mutate(clean.Age = gsub("\D|\d{3,}", '', Age))'
I get Error: '\D' is an unrecognized escape in character string starting ""\D"

What is going wrong?

Thanks,

Adrian

The text was updated successfully, but these errors were encountered:

samhinshaw · 2015-11-05T19:50:32Z

@AdrianLJones I was having this same issue. As @jennybc mentioned in class today, in R you have to escape the escape: "\\D|\\d{3,}". See the notes from @ksamuk's lecture especially "Part 2: Regex in R"

ksamuk · 2015-11-06T00:57:25Z

@samhinshaw's got it!

csiu · 2015-11-06T05:45:05Z

Along the same line of thought -- i.e. weirdness of "Regular Expression and R" -- why does this

grep("^[,\\d]+$", c(letters, "123", "1,234", "mints"), value = TRUE)

results to "d" instead of "1,234"?

ksamuk · 2015-11-06T06:33:08Z

This is going to be unsatisfying, but I've never been able to get the \w \d family of character classes to work as part of another bracketed character class (like your example which is \d and ","). In those cases, you can use [:digit:] or just 0-9 (less typing!) like:

> grep("^[,0-9]+$", c(letters, "123", "1,234", "mints"), value = TRUE)
[1] "123"   "1,234"
> grep("^[,[:digit:]]+$", c(letters, "123", "1,234", "mints"), value = TRUE)
[1] "123"   "1,234"

I think that problem might be related to the fact that items in square brackets are interpreted literally, e.g. [*] matches a literal asterix. It would then maybe make sense that in square brackets you would only have to single escape the \d, but that also doesn't work.

ksamuk · 2015-11-06T06:37:27Z

I guess a weird hack that would allow you to use \w, \d is:

> grep("^(\\d|,)+$", c(letters, "123", "1,234", "mints"), value = TRUE)
[1] "123"   "1,234"

But that's getting a bit odd, and would get messy very quickly (each new additional character would need another |).

kevinushey · 2015-11-07T01:35:31Z

I generally find that regular expressions behave much more predictably (ie, as I would expect them to work) when using perl = TRUE, e.g.

> grep("^[,\\d]+$", c(letters, "123", "1,234", "mints"), value = TRUE, perl = TRUE)
[1] "123"   "1,234"

AdrianLJones · 2015-11-11T00:41:53Z

Thanks for the help, I've encountered another problem.
Now I'm trying to clean the candy names. I saw Jenny was able to remove the curly quotes that pepper the names and replace them with regular straight quotes. After cutting and pasting the character in (how did she make it in the first place?) to my code. I got no effect with new_names3 <- gsub('[’"]','', new_names2)
Is the problem that I'm using gsub instead of str_replace? If so, why? If not, how do I get rid of the curly quotes?

Thanks,

Adrian

ksamuk · 2015-11-11T01:42:14Z

The code you posted removes the curly quotes outright (the second argument of gsub you have there is 'nothing'). To replace the quotes with a straight quote, you need to put an escaped straight quote in quotes as the second argument. See below:

> (string <- c("’curly’", "’quotes’"))
[1] "’curly’"  "’quotes’"
> gsub('[’"]','', string)
[1] "curly"  "quotes"
> gsub('[’"]','\'', string)
[1] "'curly'"  "'quotes'"

AdrianLJones · 2015-11-13T18:11:31Z

I'm sorry, I wasn't clear. I did not want to replace the curly quotes with anything, so that was not my confusion. My problem is that after trying to remove curly quotes with the code 'new_names3 <- gsub('[’"]','', new_names2)' there were still curly quotes present and I don't know why.
The quotes were in:
Box’o’ Raisins
Reese’s Peanut Butter Cups
Peanut MM’s

Interestingly your code works just fine at removing your curly quotes, but not the quotes from those three problematic candies.

I think maybe this needs assistance from the office hours.

Thanks

Adrian

Vamshi-dhar · 2017-07-24T23:07:53Z

@csiu
you can try this code to get "123" , "1,234"
grep("\\d+$", c(letters, "123", "1,234", "mints"), value = TRUE)
You cant use \d within [ ] as it understands to look for "d" instead, and results in "d".
Hope it helps..
Cheers..!!!

jennybc closed this as completed Aug 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular Expressions and R #226

Regular Expressions and R #226

AdrianLJones commented Nov 5, 2015

samhinshaw commented Nov 5, 2015

ksamuk commented Nov 6, 2015

csiu commented Nov 6, 2015

ksamuk commented Nov 6, 2015

ksamuk commented Nov 6, 2015

kevinushey commented Nov 7, 2015

AdrianLJones commented Nov 11, 2015

ksamuk commented Nov 11, 2015

AdrianLJones commented Nov 13, 2015

Vamshi-dhar commented Jul 24, 2017

Regular Expressions and R #226

Regular Expressions and R #226

Comments

AdrianLJones commented Nov 5, 2015

samhinshaw commented Nov 5, 2015

ksamuk commented Nov 6, 2015

csiu commented Nov 6, 2015

ksamuk commented Nov 6, 2015

ksamuk commented Nov 6, 2015

kevinushey commented Nov 7, 2015

AdrianLJones commented Nov 11, 2015

ksamuk commented Nov 11, 2015

AdrianLJones commented Nov 13, 2015

Vamshi-dhar commented Jul 24, 2017