# Regular Expressions

In order to demonstrate the regular expression operations, we’re going to use the simple data frame here. As you can see, it contains a few names, and email addresses from different regions. Suppose our goal is to perform data analysis on each of the domains in the email addresses.

```R
data_set <- data.frame(
    name = c("John", "Jane", "Mary", "Henry", "Jack", "Jill"),
    email = c("johnj@example.com", "janej@example.com", "marym@example.com", "henryh@sample.com", "jackj@gmail.com", "jillj@sample.com")
)
```

Suppose our goal is to perform data analysis on each of the domains in the email addresses. The problem is, some of the email addresses have regional differences, so the url’s may differ

So what we need to do is isolate all the characters between the “at sign” and period symbol. This seems tricky at first since the url’s can have variable lengths or strange characters, so regular expressions are perfect for a task like this. Regular expressions are used to match patterns in strings and text.

- Some of the most common regular expression operators are:

- `.` - Matches any single character except newline
- `^` - Anchors a match at the start of a string
- `$` - Anchors a match at the end of a string
- `*` - Matches zero or more repetitions
- `+` - Matches one or more repetitions
- `?` - Matches zero or one repetitions
- `|` - Matches either/or. Example x|y = will match either x or y
- `{}` - Matches the specified number of repetitions
- `[]` - Matches a range of characters, e.g. [A-Z] matches all letters from A to Z
- `()` - Creates a capture group and indicates precedence

- Some of the most common regular expression functions are:

- `grep()` - Returns the index of the first match
- `grepl()` - Returns TRUE if a match is found
- `regexpr()` - Returns the index of the first match along with the length of the match
- `gregexpr()` - Returns a list of indices of all matches along with the length of each match
- `sub()` - Replaces the first match
- `gsub()` - Replaces all matches
- `regexec()` - Returns more detail on the match

- Some of the most common regular expression modifiers are:

- `i` - Makes the match case insensitive
- `m` - Enables multiline mode. Treats beginning and end characters (^ and $) as working over multiple lines
- `s` - Enables single character mode. Dot (.) matches new line character
- `x` - Enables extended mode. Ignore whitespace characters
- `U` - Ungreedy mode. Inverts greediness of quantifiers so that they are not greedy by default

- Some of the most common regular expression metacharacters are:

- `\d` - Matches any decimal digit. Equivalent to [0-9]
- `\D` - Matches any non-decimal digit. Equivalent to [^0-9]
- `\s` - Matches any whitespace character
- `\S` - Matches any non-whitespace character
- `\w` - Matches any alphanumeric character. Equivalent to [a-zA-Z0-9_]
- `\W` - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]
- `\b` - Matches any word boundary character
- `\B` - Matches any non-word boundary character
- `\A` - Matches the start of a string
- `\Z` - Matches the end of a string
- `\t` - Matches a tab character
- `\n` - Matches a new line character
- `\r` - Matches a carriage return character
- `\f` - Matches a form feed character
- `\v` - Matches a vertical tab character
- `\O` - Matches a character given by octal notation \O{777}
- `\x` - Matches a character given by hexadecimal notation \x{BB}
- `\u` - Matches a character given by Unicode character notation \u{FFFF}

Let's try some of these out on our data set. First, let's use the `grep()` function to find all the email addresses that contain the word "sample".

```R
grep("sample", data_set$email)

# [1] 4 6
```

As you can see, the function returns the index of the first match. In this case, the email addresses at index 4 and 6 contain the word "sample". If we want to return the actual email addresses, we can use the `grepl()` function.

```R
grepl("sample", data_set$email)

# [1] FALSE FALSE FALSE  TRUE FALSE  TRUE
```

This function returns a boolean value for each email address. If the email address contains the word "sample", it returns TRUE, otherwise it returns FALSE. We can also use the `sub()` function to replace the word "sample" with "example".

```R
sub("sample", "example", data_set$email)

# [1] "
# [2] "
# [3] "
# [4] "
# [5] "
# [6] "
```

As you can see, the function replaces the first match with the word "example". If we want to replace all the matches, we can use the `gsub()` function.

```R
gsub("sample", "example", data_set$email)

# [1] "
# [2] "
# [3] "
# [4] "
# [5] "
# [6] "
```

Now let's try to extract the domain names from the email addresses. We can do this by using the `gregexpr()` function.

```R
gregexpr("@.*\\.", data_set$email)

# [[1]]
# [1]  5 16
# attr(,"match.length")
# [1] 10 10
# attr(,"useBytes")
# [1] TRUE
#
# [[2]]
# ...
```

As you can see, the function returns a list of indices for each email address. The first index is the position of the "@" symbol, and the second index is the position of the "." symbol. We can use these indices to extract the domain names.

```R
gregexpr("@.*\\.", data_set$email)[[1]][1]

# [1] 5
```

Regular expression can be a bit tricky at first, but once you get the hang of it, it can be a very powerful tool for data analysis. It can be use to solve some applications such as:
- Data extraction
- Data cleaning
- Data validation
- Data transformation
- Text mining
- Data wrangling
- Parsing
- Data analysis