## Quick R Tutorial

> Based on the tutorial by Kelly Black: https://www.cyclismo.org/tutorial/R/

In [1]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.4     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



### 1. Input

In [2]:
bubba <- c(1,2,3)

bubba[1]

#### 1.1 Read csv

First we read a very short, somewhat silly, data file. The data file is called simple.csv and has three columns of data and six rows. The three columns are labeled “trial,” “mass,” and “velocity.” We can pretend that each row comes from an observation during one of two trials labeled “A” and “B.”

The command to read the data file is read.csv. We have to give the command at least one arguments, but we will give three different arguments to indicate how the command can be used in different situations. The first argument is the name of file. The second argument indicates whether or not the first row is a set of labels. The third argument indicates that there is a comma between each number of each line. The following command will read in the data and assign it to a variable called “heisenberg:”



In [3]:
heisenberg <- read.csv(file = '../input/r-tutorial-sample/simple.csv', head=TRUE, sep = ',')
heisenberg

trial,mass,velocity
<fct>,<dbl>,<int>
A,10.0,12
A,11.0,14
B,5.0,8
B,6.0,10
A,10.5,13
B,7.0,11


In [4]:
summary(heisenberg)

 trial      mass          velocity    
 A:3   Min.   : 5.00   Min.   : 8.00  
 B:3   1st Qu.: 6.25   1st Qu.:10.25  
       Median : 8.50   Median :11.50  
       Mean   : 8.25   Mean   :11.33  
       3rd Qu.:10.38   3rd Qu.:12.75  
       Max.   :11.00   Max.   :14.00  

In [5]:
help(read.csv)

The variable “heisenberg” contains the three columns of data. Each column is assigned a name based on the header (the first line in the file). You can now access each individual column using a “$” to separate the two names:



In [7]:
class(heisenberg$mass)

In [8]:
heisenberg$mass[2]

In [9]:
names(heisenberg)

In [10]:
tree <- read.csv(file = '../input/r-tutorial-sample/trees91.csv', header = TRUE, sep = ',')

head(tree)

Unnamed: 0_level_0,C,N,CHBR,REP,LFBM,STBM,RTBM,LFNCC,STNCC,RTNCC,⋯,RTKCC,LFMGCC,STMGCC,RTMGCC,LFPCC,STPCC,RTPCC,LFSCC,STSCC,RTSCC
Unnamed: 0_level_1,<int>,<int>,<fct>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,1,CL6,1.0,0.43,0.13,0.29,1.84,0.4,0.96,⋯,0.62,0.11,0.13,0.06,0.23,0.31,0.17,0.13,0.17,0.14
2,1,1,CL7,1.0,0.4,0.15,0.25,1.82,0.37,0.95,⋯,0.49,0.1,0.18,0.06,0.22,0.22,0.13,0.22,0.28,0.13
3,1,2,A1,9.0,0.45,0.2,0.21,1.54,0.96,0.69,⋯,0.64,0.12,0.16,0.08,0.3,0.35,0.21,0.15,0.19,0.15
4,1,2,A1,14.0,0.82,0.26,0.29,1.75,0.97,0.83,⋯,0.64,0.12,0.16,0.08,0.3,0.35,0.21,0.15,0.19,0.15
5,1,2,A1,20.0,0.52,0.19,0.25,2.01,1.29,0.8,⋯,0.64,0.12,0.16,0.08,0.3,0.35,0.21,0.15,0.19,0.15
6,1,2,A7,,1.32,0.46,0.48,1.45,0.92,0.72,⋯,0.42,0.13,0.14,0.07,0.23,0.25,0.15,0.15,0.16,0.13


There are many different ways to keep track of data in R. When you use the read.csv command R uses a specific kind of variable called a “data frame.” All of the data are stored within the data frame as separate columns. If you are not sure what kind of variable you have then you can use the attributes command. This will list all of the things that R uses to describe the variable:



In [11]:
attributes(tree)

The first thing that R stores is a list of names which refer to each column of the data. For example, the first column is called “C”, the second column is called “N.” Tree is of type data.frame. Finally, the rows are numbered consecutively from 1 to 54. Each column has 54 numbers in it.



In [None]:
names(tree)