#### gsub() is an R function whose name stands for "global substitution"

You can think of it as the R version on "Find/Replace" in Excel. Let's look at an example:

![][1]

[1]: 
https://ist387.s3.us-east-2.amazonaws.com/images/Picture1.png "DS"

#### What would be the outcome in the "users" column of this spreadsheet if you press "Replace All"?

Correct - since the "Replace with:" box is empty, the "users" column will now contain just the usernames of these 3 students, since "@syr.edu" will have been replaced with... nothing. As you can see, nothing can often sometimes actually do something! 

**users** <br>
alp <br>
sh67 <br>
tans <br>

Now let's try to achieve the same result in R! We will start out by replicating the data from the 3 Excel columns into 3 R vectors:

In [5]:
names <- c("Alex", "Shu", "Tanya")
ages <- c(28, 17, 35)
users <- c("alp@syr.edu", "sh67@syr.edu", "tans@syr.edu")

Let's check if it worked by printing, or calling the 3 vectors: 

In [7]:
names
ages
users

It looks like they're all there! Time to merge them into a dataframe with 3 rows and 3 columns to replicate what we see in the spreadsheet screenshot above:

In [8]:
students <- data.frame(names, ages, users)

It ran without error. Let's print it to see if it looks the way it's supposed to:

In [9]:
students

names,ages,users
<chr>,<dbl>,<chr>
Alex,28,alp@syr.edu
Shu,17,sh67@syr.edu
Tanya,35,tans@syr.edu


All good! We are now ready to use the gsub() function to clean the "users" column so we are only left with the username of each student, i.e. get rid of the "@syr.edu" suffix:

In [10]:
students$usernames <- gsub("@syr.edu", "", students$users)

Did we overwrite the "users" column or did we actually do something else? Let's see by printing the students df again:

In [11]:
students

names,ages,users,usernames
<chr>,<dbl>,<chr>,<chr>
Alex,28,alp@syr.edu,alp
Shu,17,sh67@syr.edu,sh67
Tanya,35,tans@syr.edu,tans


As you can see, we didn't exactly overwrite it. Instead, we created a new column called "usernames." How would you overwrite "users" instead? Hint: "usernames" from line 10 should instead be: users. <br>

What is the syntax of gsub() in line 10? That is, what arguments does it take to work? Let's break down the stuff in parentheses: "@syr.edu" is the stuff we want gsub() to find. Just like you would type the characters you want to find in the Excel Find function. It's wrapped in quotes because it is a string of text (= chr). <br>

The second argument, "", is what we want to replace the found string with. In our case, just like we left the Replace field blank in Excel, we are telling R we want to replace "@syr.edu" with... NOTHING!, empy string. <br>

Finally, the third argument, students$users, tells gsub() which df column exactly to perform this Find/Replace function on. In our case we want it to focus on the "users" column, and not, for example, "names" or "ages."

#### gsub() shortcuts you may find useful

In Excel, you need to type explicitly the character or symbol you want to find and/or replace. R on the other hand, can be used both with explicit statements like that, or with more general searches. <br>

The following code will find the expression "Good" and remove it from the word "Goodbye":

In [12]:
gsub("Good", "", "Goodbye")

You can see if the function is case-sensitive by trying the same syntax on "goodbye" instead:

In [13]:
gsub("Good", "", "goodbye")

Now let's try it with a more general search. What if I wanted to find and replace all the numbers in a sentence without explicitly putting in every single number separately? Let's take the following sentence as an example:

In [16]:
sentence <- "I am 29 years old, and weigh 130 lbs"

I could do something like this:

In [17]:
gsub("29", "X", sentence)

But if I wanted to identify and replace both numbers (or any number of numbers for that matter) in one go, I can do something like this:

In [19]:
gsub("\\d+", "X", sentence)

"\\d+" will catch any number. "d" stands for digit.

What does this one do:

In [20]:
gsub("\\w+", "X", sentence)

How about this:

In [21]:
gsub("\\s", "X", sentence)

You can learn more about this cool feature called "regular expression" here: http://www.endmemo.com/r/gsub.php

Search pattern helpers like regular expression are there to make our work easier. **Functions** have a similar purpose in coding - they automate tasks so instead of having to run the same lines of code for things we do often over and over again, we can run the code once, wrap it in a function, and just call the function the next time we need to accomplish a similar task.

Let's add another column to our "students" df:

In [22]:
students$gpa <- c("gpa_4.3", "gpa_3.6", "gpa_2.9")

In [23]:
students

names,ages,users,usernames,gpa
<chr>,<dbl>,<chr>,<chr>,<chr>
Alex,28,alp@syr.edu,alp,gpa_4.3
Shu,17,sh67@syr.edu,sh67,gpa_3.6
Tanya,35,tans@syr.edu,tans,gpa_2.9


Our next task is to calculate the average GPA of this 3-person class. Let's break down this task into steps: <br>
1. We need to get to the numbers in the "gpa" column by getting rid of the "gpa_" suffix. Luckily, we know a function that can help us do just that:

In [24]:
students$gpa <- gsub("gpa_", "", students$gpa)

In [25]:
students$gpa

Success! Can we then just use vector math on the "gpa" variable and get our answer? Let's try:

In [26]:
mean(students$gpa)

“argument is not numeric or logical: returning NA”


Hmm, R doesn't seem to like our simple command, and tells us that the argument we provided to the mean() function is not numeric. We do see numbers though, what's going on? We can quickly check the df for more information:

In [27]:
students

names,ages,users,usernames,gpa
<chr>,<dbl>,<chr>,<chr>,<chr>
Alex,28,alp@syr.edu,alp,4.3
Shu,17,sh67@syr.edu,sh67,3.6
Tanya,35,tans@syr.edu,tans,2.9


I see now - the "gpa" var is still coded as <chr>, i.e. character, not a number. Step 2 in our process then should be to convert this variable into numeric:

In [28]:
students$gpa <- as.numeric(students$gpa)

In [29]:
students

names,ages,users,usernames,gpa
<chr>,<dbl>,<chr>,<chr>,<dbl>
Alex,28,alp@syr.edu,alp,4.3
Shu,17,sh67@syr.edu,sh67,3.6
Tanya,35,tans@syr.edu,tans,2.9


"dbl" stands for "double" - a numeric data type. Let's try the mean() function again:

In [30]:
mean(students$gpa)

Woohoo, it worked! Essentially, we needed to execute the following steps of code: <br>
1. Clean the "gpa" variable using **gsub()** <br>
2. Convert the cleaned "gpa" variable from chr to a numeric type using **as.numeric()** <br>
3. Calculate the average using **mean()**

Imagine the following situation: your boss asks you to perform this task on 90 different classes of students. And only gives you an hour! If my math is correct, that means running 90x3 lines of code (and that's not even counting the code you need to read in all this data). There must be an easier way! <br>
Indeed, there is. We can create a function called **averageGPAcalculator()** and automate this process. Here is how it works: <br>
We need to think of a name for our function first. I already suggested one, let's use it:

In [33]:
averageGPAcalculator <- function(input){
    step1 <- "do something here"
    step2 <- "do something else"
    step3 <- "maybe even some more"
    return(step3)
}

If you click "Run", nothing will happen. This is because these lines of code just memorize the function. To use it, we need to actually "call" it on something. We need an input. Let's try it on a text string: 

In [35]:
averageGPAcalculator("what can I do with this?")

The output is 'maybe even some more', how so? Let's discuss the different elements of this function. <br>
1. We declared it by giving it a name, averageGPAcalculator. <br>
2. We told R it's not anything else, but a **function**, <br>
3. ...and told it that our function will take a single argument which we called "input" - you could have used any other word and it would still work because it's nothing more than a placeholder. <br>
4. Inside the curly braces, {}, is where the body of the function is, i.e. the stuff we want it to do. In this case, we have 3 steps, but you could really have as many or as few as you want. <br>
5. When we got done with the last step, we told R we want it to output the result of the last step with the return() function. Had this return() line not been there, we wouldn't have seen the fruit of our labor.

This function is not very useful though. Whatever input, aka argument, we provide, it will always return the exact same thing, "maybe even some more." <br>
To make it perform the task we set out to figure out - actually calculating the average GPA of a class, we'd need to modify the steps in the body of the function:

In [36]:
averageGPAcalculator <- function(input){
    step1 <- gsub("gpa_", "", input$gpa)
    step2 <- as.numeric(step1)
    step3 <- mean(step2)
    return(step3)
}

What we did was essentially take the separate commands we executed earlier and bundle them together into a function. Let's test it on our **students** df. If the function runs correctly, we should get the same value for the average GPA as in the single steps we performed earlier - 3.6:

In [37]:
averageGPAcalculator(students)

Great, it works! Now we know that we can tackle any df which has a similar setup in just one line of code! Let's create another df and try it again:

In [39]:
name <- c("Bree", "Grace", "Akshat", "Penny", "Leon")
gpa <- c(2.3, 4.0, 3.9, 3.5, 2.8)
anotherGroupOfStudents <- data.frame(name, gpa)

In [40]:
averageGPAcalculator(anotherGroupOfStudents)

You can practice on your own by creating new dfs or tweaking the body of the function;)