<hr style="height:0px; visibility:hidden;" />

<h1><center>3. R introduction</center></h1>

<div class="alert alert-block alert-success">
There are many different ways to process amplicon data, but the primary program we are going to use is built in R. So here we are going to develop a little foundation for working in R. Just like with Unix, none of this is about memorization. R in particular can become a little messy looking if we aren't yet used to it, so don't worry if things seem a little confusing. This is just about exposure for now 🙂
</div>

---

<center>This is notebook 3
of 6 of <a href="00-overview.ipynb">GL4U's Amplicon Bootcamp</a>. It is expected that the previous notebooks have been completed already.</center>

---

[**Previous:** 2. Unix intro](02-unix-intro.ipynb)
<br>

<div style="text-align: right"><a href="04-setup-QC.ipynb"><b>Next:</b> 4. Setup and QC</a></div>

---
---

# Table of Contents

* [0. What is R?](#zeroR)
* [1. Introduction to programming in R](#oneR)
    * [1a. Installing and loading packages/libraries in R](#lib)
    * [1b. Defining locations and assigning variables in R](#variable)
    * [1c. Loading and viewing a data file in R](#file)
        * [help()](#help)
        * [read.csv()](#read)
        * [head()](#headR)
        * [dim()](#dim)
        * [summary()](#summary)
    * [1d. Data frame manipulations](#df)
        * [Add a value to all cells](#dfadd)
        * [Take the log of all cells](#dflog)
        * [Convert data frame to contain only integers](#dfint)
        * [Slice a data frame column](#dfcol)
        * [Slice a data frame row](#dfrow)
        * [Filter data in a data frame](#dffilter)
        * [Add columns to a data frame](#dfmore)
        * [Combine data frames](#dfcombine)
    * [1e. Export data from R](#export)
    * [1f. Visualizations](#viz)

<br>

---
---

<a class="anchor" id="zeroR"></a>

# 0. What is R?

R is a programming language designed specifically for statistical analysis and computing. R has many bioinformatic libraries (bundles of code made for performing specific tasks) for statistical analysis, as well as for data visualization. What really adds to the power of R is that it is a completely free and open-source project, and there is an enormous community of people all over the world that create these libraries of code for all of us to use.

---

<a class="anchor" id="oneR"></a>

# 1. Introduction to programming in R

Commands are more commonly called functions in R. The syntax of how these generally work in R looks like this: 

`function(arguments)`

Where the function name comes first and is followed by parentheses which would contain any arguments if they were needed. 

Even for functions that don't require any arguments, the function name still typically needs to be followed by parentheses so R knows to execute it. So running a function that doesn't need any arguments would look like this:

`function()`

<a class="anchor" id="lib"></a>

## 1a. Installing and loading packages/libraries in R

An R library, or package, is a collection of functions, compiled code, and sometimes data. Some R packages are installed automatically with the base R installation, but there are tens of thousands that are not. The most prominent repositories for R packages are the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/web/packages/) and [Bioconductor](https://www.bioconductor.org/). 

If we want to use a package, we first may need to install it if we don't have it already. The typical way to install a package from the CRAN repository is with the `install.packages()` function. For example, the following is the syntax to install the BiocManager package, which is then itself used to install any package from the [Bioconductor](https://www.bioconductor.org/) suite.

`install.packages("BiocManager")`

<div class="alert alert-block alert-info">
    <b>Note</b>
    <br>
    You don't need to run any of the commands from this 1a section here, they are just here as examples.
</div>

[Bioconductor](https://www.bioconductor.org/) packages including [DESeq2](https://www.bioconductor.org/packages/release/bioc/html/DESeq2.html), a leading program for differential gene-expression analysis, are installed using the BiocManager `install()` function (instead of the `install.packages()` function). Here is how we would install DESeq2:

`BiocManager::install("DESeq2")`


After a package is installed, we can access the function in it by prefacing the function name (`install("DESeq2")` in the above example) with the package name and 2 colons (`BiocManager::` in the above example). Sometimes we may want to do that in order to be explicit, but when we will regularly be using many of the functions within a package it can often be more convenient to load the package. We do this with the `library()` function as shown in the example below:

`library(DESeq2)`

And then we can just type out the function names without the package name in front of them.

---

<a class="anchor" id="variable"></a>

## 1b. Defining locations and assigning variables in R

#### getwd(), list.files(), and setwd()

The `getwd()` function in R stands for "get working directory". This function lists the current working directory, and is similar to the `pwd` command in Unix. 

The `list.files()` function is similar to `ls` at the command line. It will list the contents of the directory we provide, or the current working directory by default.

The `setwd()` function in R stands for "set working directory". This function allows us to change the current working directory, and is similar to the `cd` command in Unix.

Let's run `getwd()` in the next cell to print the current working directory. This particular function doesn't require any arguments.

In [5]:
getwd()

Notice that the output of `getwd()` has quotation marks around it. That is because R is treating the path as a **string**, which is a one-dimensional array of characters that only represents the textual information contained between the quotation marks.

For example, 'Sally Ride was the first female American astronaut to go into space on June 18, 1983' is a string. 

In the next cell, try running `setwd()` to change the current working directory to the `intro/SolarSystem` directory you made in the Unix Intro. Then, run `getwd()` to check if you were successful. 

<div class="alert alert-block alert-info">
    <b>Note</b>
    <br>
    Remember that R expects paths as <b>strings</b>, so be sure to pass your path to <code>setwd()</code> as a string.
</div>
 

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

This will set the current working directory to the right place using a *relative path* (which will work if we're in our home directory currently): 

`setwd("intro/SolarSystem")`


This will set the current working directory using an *absolute path* (which will work no matter where we are, so long as the directory exists):
    
`setwd("~/intro/SolarSystem")`


And this prints our current working directory:

`getwd()`

</details>
</div>

#### Assign variables in R

In most programming languages, including R, we use **variables** to store information. Variables are named objects which refer to data stored in memory. In R, we use the `<-` operator to assign information to a variable. Remember that it is very important to choose informative, memorable, and short variable names.

For example, in the next cell, we are going to use the variable `x` to refer to the result of the equation `2+4`.

In [None]:
x <- 2+4

We can access the data stored in the variable `x` by executing the variable name by itself:

In [None]:
x

We can use variables to hold paths to locations so that we don't have to type out the path every time. Run the cell below to assign the path to the SolarSystem directory to the `workDir` variable. We chose the `workDir` variable as a short representation of the phrase "work directory".

In [9]:
workDir <- '~/intro/SolarSystem'

Use the next cell to print the content of the `workDir` variable. Did you assign your variable successfully?

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

Running just the variable name will print out its contents:

`workDir`

</details>
</div>

We can also use variables to store large amounts of data, such as assigning data within a matrix to a variable as you'll see in the next section.

<div class="alert alert-block alert-warning">
<center><b>WARNING</b></center>
Never name your variable after a common function or a built-in variable in R. For a list of built-in R functions and variables, see "Appendix D Function and variable index" of the <a href="https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf">R manual (pdf)</a>.
</div>



---

<a class="anchor" id="file"></a>

## 1c. Loading and viewing a data file in R

To begin working with a data file in R, we first have to load (aka read in) the file. How you load the data file in R will depend on the file type. R has several built-in functions for reading in common file types, including .csv (comma-separated values) files and .tsv (tab-separated-values) files.

In this tutorial, we are going to read in a .csv file.

To do this, we will use a R function called `read.table()`. This function will create a data frame, which is a very common data structure in R. **Note:** A data frame is similar to a matrix. 

<a class="anchor" id="help"></a>
#### help()

Before reading in the .csv file, we'll use the `help()` function in R to look at the arguments available for the `read.table()` function. (Note: this is similar to the `--help` option that's used in most Unix commands). 

<div class="alert alert-block alert-info">
<b>Note</b>
<br>
    
There is a lot of information in the output from the following <code>help()</code> menu. Often reading data into R is something we will try, then check what was read in, then try again as we figure out which arguments are needed for our particular data to be read in properly. Scan through the help menu, and pay particular note to `header`, `sep`, and `row.names`, as those are relevant to the example data we are going to use, but don't worry about reading *everything* in there.

After looking at the help information, you can comment out this command by adding a <code>#</code> at the beginning of the line, and then rerun the cell to hide the output. Or you can select the output block, and click the vertical blue bar to its left to compress it.

</div>


In [45]:
help(read.table)

0,1
read.table {utils},R Documentation

0,1
file,"the name of the file which the data are to be read from. Each row of the table appears as one line of the file. If it does not contain an absolute path, the file name is relative to the current working directory, getwd(). Tilde-expansion is performed where supported. This can be a compressed file (see file). Alternatively, file can be a readable text-mode connection (which will be opened for reading if necessary, and if so closed (and hence destroyed) at the end of the function call). (If stdin() is used, the prompts for lines may be somewhat confusing. Terminate input with a blank line or an EOF signal, Ctrl-D on Unix and Ctrl-Z on Windows. Any pushback on stdin() will be cleared before return.) file can also be a complete URL. (For the supported URL schemes, see the ‘URLs’ section of the help for url.)"
header,"a logical value indicating whether the file contains the names of the variables as its first line. If missing, the value is determined from the file format: header is set to TRUE if and only if the first row contains one fewer field than the number of columns."
sep,"the field separator character. Values on each line of the file are separated by this character. If sep = """" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns."
quote,"the set of quoting characters. To disable quoting altogether, use quote = """". See scan for the behaviour on quotes embedded in quotes. Quoting is only considered for columns read as character, which is all of them unless colClasses is specified."
dec,the character used in the file for decimal points.
numerals,"string indicating how to convert numbers whose conversion to double precision would lose accuracy, see type.convert. Can be abbreviated. (Applies also to complex-number inputs.)"
row.names,"a vector of row names. This can be a vector giving the actual row names, or a single number giving the column of the table which contains the row names, or character string giving the name of the table column containing the row names. If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names. Otherwise if row.names is missing, the rows are numbered. Using row.names = NULL forces row numbering. Missing or NULL row.names generate row names that are considered to be ‘automatic’ (and not preserved by as.matrix)."
col.names,"a vector of optional names for the variables. The default is to use ""V"" followed by the column number."
as.is,"controls conversion of character variables (insofar as they are not converted to logical, numeric or complex) to factors, if not otherwise specified by colClasses. Its value is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors. Note: to suppress all conversions including those of numeric columns, set colClasses = ""character"". Note that as.is is specified per column (not per variable) and so includes the column of row names (if any) and any columns to be skipped."
na.strings,"a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance."


<a class="anchor" id="read"></a>
#### read.table()

Notice that to run `read.table()`, we'll have to provide the function with a set of arguments. These arguments allow you to give R some information about the data in your .csv file. Each possible argument within a function is separated by a `,`. The default arguments for `read.table()` are specified right near the top of the help menu, under "Usage", and include things like `header = FALSE` and `sep = ""`. 

In this tutorial, we are going to provide `read.csv()` with the following arguments: 

* `example.csv` – the name of the file we want to read in
* `header = TRUE` – this argument specifies that the data starts on the second line of the file and the first line is a header, or column names
* `row.names = 1` – this argument specifies that the data starts on the second column of the file and the first column contains row names

In the next cell, we will read in the `example.csv` file, and store the data frame in the variable `myDF`. We chose the variable `myDF` as a short representation of the phrase "my data frame".

In [52]:
myDF <- read.table("example.csv", sep = ",", row.names = 1)

In [53]:
myDF

Unnamed: 0_level_0,V2,V3,V4
Unnamed: 0_level_1,<chr>,<chr>,<chr>
,column1,column2,column3
row1,1,2,3
row2,4,5,6
row3,7,8,9
row4,10,11,12
row5,13,14,15
row6,16,17,18
row7,19,20,21
row8,22,23,24
row9,25,26,27


In [46]:
myDF <- read.csv('example.csv', header = TRUE, row.names = 1)

Since the `example.csv` file is in our current working directory, we could just provide the file name as the relative path for the computer to find it. However, if the file were located in a directory other than the current working directory, we would have to define the path to the directory that holds your file. 

If the directory location of the file is held in a variable (as with the workDir we set in [section 1b](#variable)), you can use the function `file.path()` to construct the path to the file from the directory variable and the filename. 

For example, since you stored the path to your work directory in the `workDir` variable, you could have also read in the `example.csv` file using the following syntax: 

`myDF <- read.csv(file.path(workDir, 'example.csv'), header = TRUE, row.names = 1)`

It's not uncommon to "nest" functions like this in R, where here the `file.path()` function is being called inside the `read.csv()` function.

Use the next cell to view the data stored in the `myDF` variable.

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>
    
`myDF`

</details>
</div>



How many columns are in the `myDF` data frame?

How many rows?

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>
<br>

There are 3 columns and 10 rows.

</details>
</div>

<a class="anchor" id="headR"></a>
#### head()

Although in this example there is a manageable amount of data in `myDF`, in many cases, viewing all the data may be unfeasible. Thus, similar to the `head` command in Unix, R also has a built-in function to view only a certain number of rows of a data frame or matrix, called `head()`. 

In the next cell, run `head()` and provide only one argument, the data frame variable.

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

`head(myDF)`

</details>
</div>

How many lines are given by default with the `head()` command in R? Is this different from the number of lines given by default with the `head` command in Unix? 

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

`head()` in R prints out 6 lines by default
    
`head` in Unix prints out 10 by default

</details>
</div>


We can also specify the number of lines we want to view, by providing `head()` with another argument, `n=3`:

In [22]:
head(myDF, n = 3)

Unnamed: 0_level_0,column1,column2,column3
Unnamed: 0_level_1,<int>,<int>,<int>
row1,1,2,3
row2,4,5,6
row3,7,8,9


Use the next cell to print the first 8 rows of the `myDF` data frame.

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

`head(myDF, n = 8)`
    
</details>
</div>

<a class="anchor" id="dim"></a>
#### dim()

R also has a funtion called `dim()` that will allow you to print the dimensions of the data frame without having to print any lines. 

In the next cell, you will use the `dim()` function to report the *dimensions*, or number of rows and columns, of the `myDF` data frame. 

In [23]:
dim(myDF)

Is the row or column dimension reported first?

How many rows does `myDF` have? How many columns? 

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>
Number of rows is reported first.
    
10 rows and 3 columns
</details>
</div>

<a class="anchor" id="summary"></a>
#### summary()

To get more information about a loaded data frame without having to print the entire thing, we can use the `summary()` function in R to print a mathematical summary. 

<a class="anchor" id="help"></a>
#### help()

Before reading in the .CSV file, we'll use the `help()` function in R to view all possible arguments for the `read.csv()` function. (Note: this is similar to the `--help` option that's used in most Unix commands). 

<div class="alert alert-block alert-info">
<b>Note</b>
<br>

`summary()` can also be used on non-data frame objects to report all the data components contained in the object.
</div>

Run the `summary()` function in the next cell to view a mathematical summary of your `myDF` data frame.

In [24]:
summary(myDF)

    column1         column2         column3     
 Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 4.75   1st Qu.: 5.75   1st Qu.: 6.75  
 Median :11.50   Median :12.50   Median :13.50  
 Mean   :11.70   Mean   :12.60   Mean   :13.50  
 3rd Qu.:18.25   3rd Qu.:19.25   3rd Qu.:20.25  
 Max.   :25.00   Max.   :26.00   Max.   :27.00  

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>
`summary(myDF)`

</details>
</div>


Which column contains the highest median value in the data frame, `myDF`? 

Which column contains the highest overall value?

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

Column 3 contains the highest median value and the highest overall value.

</details>
</div>


---

<a class="anchor" id="df"></a>

## 1d. Data frame manipulations

Much of the data we work with in bioinformatics is in the data frame or matrix format. For example, gene expression data is usually held in matrix format, with samples as columns and genes as rows. Each entry or cell in the matrix contains the expression of a particular gene in a particular sample. 

When analyzing numerical data in table format, it can be useful to be able to perform mathematical functions on all cells in a data frame, such as adding a value to all cells or taking the log of all cells. Fortunately, R makes that easy for us to do. 

Below are some examples of common mathematical manipulations we often perform on data frames in bioinformatics.  

<a class="anchor" id="dfadd"></a>
#### Add a value to all cells

In R, you can add, subtract, multiply, or divide the number in every cell of a data frame by a specific value very easily. Run the command in the next cell to add `1` to every value in your `myDF` data frame.

In [25]:
myDF + 1

Unnamed: 0_level_0,column1,column2,column3
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
row1,2,3,4
row2,5,6,7
row3,8,9,10
row4,11,12,13
row5,14,15,16
row6,17,18,19
row7,20,21,22
row8,23,24,25
row9,26,27,28
row10,1,1,1


Use the next cell to subtract 2 from all values in your `myDF` data frame.

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

`myDF - 2`

</details>
</div>

<a class="anchor" id="dflog"></a>
#### Take the log of all cells

R also has a `log()` function that will allow you to take the log of all values in a data frame. By default, the `log()` function will calculate the natural log. However, you can specify which base you want to use by providing an optional argument, such as `base = 3`. There are also shortcut functions for commone ones:

`log10()` will compute the common logarithm (base 10)

`log2()` will compute the binary logarithm (base 2)

Run the next cell to compute the natural logarithm of all values in your `myDF` data frame.

In [31]:
log(myDF)

Unnamed: 0_level_0,column1,column2,column3
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
row1,0.0,1.0,1.584963
row2,2.0,2.321928,2.584963
row3,2.807355,3.0,3.169925
row4,3.321928,3.459432,3.584963
row5,3.70044,3.807355,3.906891
row6,4.0,4.087463,4.169925
row7,4.247928,4.321928,4.392317
row8,4.459432,4.523562,4.584963
row9,4.643856,4.70044,4.754888
row10,-inf,-inf,-inf


Use the next cell to compute the binary logarithm of all values in your `myDF` data frame.

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>
This could be done with:<br><br>

`log2(myDF)`

Or:<br>

`log(myDF, base = 2)`

</details>
</div>


<a class="anchor" id="dfint"></a>
#### Convert data frame to contain only integers

Some bioinformatics applications, such as DESeq2 which is commonly used for differential gene expression analysis, require that the input data contain only integers. There is a function in R called `ceiling()` that will round decimal values up to the nearest integer. 

Run the next cell to test the `ceiling()` function.

In [32]:
ceiling(1.2)

Before we test this function on a data frame, we first have to create a data frame that contains decimal values. Note that although we did things like using the `log()` function to print the natural logarithm of all values in the `myDF` data frame, those calculations were not saved in the `myDF` variable.

Use the next cell to print the data held in the `myDF` variable. Are the values indeed the same as what we started with?

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

`myDF`

Yes
</details>
</div>


To make changes to your `myDF` variable, the calculations must be assigned to `myDF`. In the following cell, you will subtract 0.3 from all values in your `myDF` data frame and assign the new values back to the `myDF` variable (though keep in mind this will overwrite the initial data stored there).

In [35]:
myDF <- myDF - 0.3

In the next cell, print the data held in the `myDF` variable. Have the values in `myDF` changed?

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

`myDF`
    
Yes
</details>
</div>


<div class="alert alert-block alert-info">
<b>Note</b>
<br>
If we wanted to keep a variable containing the original data in the data frame and also preserve the calculations performed, we could have assigned the calculations performed on <code>myDF</code> to a different variable as follows:
<code>myDFsub <- myDF - 0.3</code>.

</div>

    
Now that your `myDF` data frame contains decimal values, use the `ceiling()` function to round all the values in `myDF` up to the nearest integer, and assign the new values back to the `myDF` variable.

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

`myDF <- ceiling(myDF)`

</details>
</div>


In the next cell, print the data held in the `myDF` variable. How have the values in `myDF` changed?

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>
    
`myDF`

They have all been rounded up to the nearest integer.

</details>
</div>


<a class="anchor" id="dfcol"></a>
##### Slice a data frame column

When analyzing bioinformatics data, you may need to extract only one column from a data frame. To subset a data frame based on column names, we use the bracket `[` operator. This type of operation is also referred to as "slicing" the data frame. 

Run the cell below to slice `column1` of your `myDF` data frame.

In [37]:
myDF['column1']

Unnamed: 0_level_0,column1
Unnamed: 0_level_1,<dbl>
row1,0.7
row2,3.7
row3,6.7
row4,9.7
row5,12.7
row6,15.7
row7,18.7
row8,21.7
row9,24.7
row10,-0.3


**Challenge:** In the next cell, try subsetting `myDF` to "column2", but only view the first 3 rows of the output using `head()`.

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>
    
`head(myDF['column2'], n = 3)`

</details>
</div>

<a class="anchor" id="dfrow"></a>
#### Slice a data frame row

To subset a data frame based on row names, we again use the bracket `[` operator, but we add a comma after indicating the row name, which lets R know that we are slicing along rows instead of columns.

Run the cell below to slice `row1` of your `myDF` data frame.

In [39]:
myDF['row1', ]

Unnamed: 0_level_0,column1,column2,column3
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
row1,0.7,1.7,2.7


Within those `[]` brackets following a 2-dimentional object like the data frame we are using here, when there is just one thing listed like we initially did specifying a column name, it pulls out those columns. If we provide a comma in there, it expects the value preceding the comma to specify the rows we want (like we did above with 'row1', and the value after the comma to specify the columns we want. When we leave one of those blank, like we did with columns above, it assumes we want all columns.

There are multiple ways we can specify which rows or columns we want, including by name, but we can also pass TRUE/FALSE vectors, as we'll see next.

<a class="anchor" id="dffilter"></a>
#### Filter data in a data frame 

When analyzing bioinformatics data, we often need to filter the data to reduce noise. A common filtering method is to remove rows that have all zero values. To do this, we will remove all rows whose values sum to zero using a function called `rowSums()`.

First let's calculate the sum of each row in your `myDF` data frame using the `rowSums()` function:

In [40]:
rowSums(myDF)

Next, we'll use the greater than mathematical operator, `>`, to identify which rows have sums greater than zero:

In [41]:
rowSums(myDF) > 0

Finally, we can apply this output to subset the `myDF` data frame by removing all rows whose values sum to zero, using the same row slicing method we used above:

In [42]:
myDF[ rowSums(myDF) > 0 , ]

Unnamed: 0_level_0,column1,column2,column3
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
row1,0.7,1.7,2.7
row2,3.7,4.7,5.7
row3,6.7,7.7,8.7
row4,9.7,10.7,11.7
row5,12.7,13.7,14.7
row6,15.7,16.7,17.7
row7,18.7,19.7,20.7
row8,21.7,22.7,23.7
row9,24.7,25.7,26.7


R is converting the expression `rowSums(myDF) > 0` into a TRUE/FALSE vector, and then only returning the rows where the vector holds a value of TRUE.

Use what you've just learned to remove all rows in `myDF` whose sum is less than 20, in the next cell. 

In [43]:
myDF[ rowSums(myDF) < 20 , ]

Unnamed: 0_level_0,column1,column2,column3
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
row1,0.7,1.7,2.7
row2,3.7,4.7,5.7
row10,-0.3,-0.3,-0.3


<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

`myDF[ rowSums(myDF) < 20 , ]`

</details>
</div>


Which rows remain after filtering by those whose row sums are less than 20?

<div class="alert alert-block alert-success">

<details>
<summary><b>Solution</b></summary>

<br>

Rows 1, 2, and 10.

</details>
</div>

<a class="anchor" id="dfmore"></a>
#### Add columns to a data frame

When generating a table containing results from a bioinformatic analysis, it may be useful to add a column. To add a column to a data frame, we use the `[` bracket operator to name the new column. We then turn the new column into a variable using the `<-` operator, and we assign a list of values to that variable. 

To create a list of values in R, we use the `c()` function, which is a method that combines all its arguments to form a vector, or list.

Run the next cell to add a 4th column to your `myDF` data frame, then print the revised data frame, `myDF`.

In [None]:
myDF['column4'] <- c(1,2,3,4,5,6,7,8,9,10)
myDF

Use the next cell to add a 5th column to your `myDF` data frame, then print the revised data frame.

<a class="anchor" id="dfcombine"></a>
#### Combine data frames

Sometimes in bioinformatics, we have two (or more) data frames that we want to combine into one data frame. To do this, we can use the R function `cbind()`. `cbind()` requires at least two arguments: the names of the two data frames that need to be combined.

Let's duplicate `myDF` and then use `cbind()` to merge the original and duplicated data frames.

Run the following cell to create a copy of `myDF` in a variable called `myDF2`, then view the contents of the `myDF2` data frame.

In [None]:
myDF2 <- myDF
myDF2

Now that we have two data frames, `myDF` and `myDF2`, let's use `cbind()` to combine them into one data frame:

In [None]:
cbind(myDF, myDF2)

Now, instead of merely printing the combined data frame, use the next cell to create a variable called `combinedDF` that holds the combined data frame. We suggest the `combinedDF` as a short representation of the phrase "combined data frame".

What are the dimensions of the `combinedDF` data frame? Hint: use the `dim()` function in the cell below.

Did `cbind()` merge these data frames on the row dimension or the column dimension? Why do you think that is? Hint: you can take a look at the `cbind()` documentation using the `help()` function.

---

<a class="anchor" id="export"></a>

## 1e. Export data from R

<a class="anchor" id="write"></a>

#### write.csv()

Thus far, we have manipulated data frame variables in R, but the altered data is only stored in memory until you export it. How you export data in R will depend on the file type to which you want to export the data. Similar to loading data in R, R has several built-in functions for exporting common file types, including .TXT files, .CSV (comma-separated values) files and .TSV (tab-separated-values) files.

In this tutorial, we will export the data as a .CSV file. To do this, we will use the `write.csv()` function to write a data frame out to a file. The following arguments are needed to execute the `write.csv()` function: 

* the data frame we want to write out 
* the file name we want to write to

In the next cell, you will use `write.csv()` to export your `combinedDF` data frame to a file called `combinedDF.csv`.

In [None]:
write.csv(combinedDF, 'combinedDF.csv')

List all files in your current directory using the `list.files()` function:

In [None]:
list.files()

Do you see your `combinedDF.csv` file in your current directory?

Challenge: Use the next few cells to export your `myDF2` data frame as a .CSV file called `myDF2.csv` but this time, specify your home directory as the path you want to write your file to (hint: use the `homeDir` variable you created previously). Then list all files in your home directory. Did you successfully export the `myDF2.csv` file?

---

<a class="anchor" id="viz"></a>

## 1f. Visualizations

<a class="anchor" id="plot"></a>

#### plot()

R has a basic built-in function for many common plot types called `plot()`. 

At its most basic, the function call is: `plot(x,y)`, where x and y are numeric vectors containing the (x,y) points for the plot.

Let's call `plot()` with the following parameters: 

* x = the values from myDF2, column 1
* y = the values from myDF2, column 4


We saw before that we can use the following syntax to extract just one column from a dataframe: 
`myDF2['column1']`.  But recall that this produces a subset dataframe, not a vector of numbers. 

If we need just the values from a column as a vector of numbers, we can use the following syntax: 
`myDF2$column1`
> Note: The `$` specifies the column title.

Use this syntax to fill in the x and y vectors in the function call below: 


In [None]:
plot()

Let's see if we can make this plot more interesting. Let's look at the parameters for `plot()`:

In [None]:
help(plot)

In the next few cells, recreate this plot but pass different values to the following 2 parameters: `type` and `col`. 

**Challenge:** Can you create a plot with both points and lines, colored purple? 

<br>

---
---

[**Previous:** 2. Unix intro](02-unix-intro.ipynb)
<br>

<div style="text-align: right"><a href="04-setup-QC.ipynb"><b>Next:</b> 4. Setup and QC</a></div>