# 1\. Input

Here we explore how to define a data set in an R session. Only two commands are explored. The first is for simple assignment of data, and the second is for reading in a data file. There are many ways to read data into an R session, but we focus on just two to keep it simple.

## 1.1\. Assignment

The most straight forward way to store a list of numbers is through an assignment using the c command. (c stands for “combine.”) The idea is that a list of numbers is stored under a given name, and the name is used to refer to the data. A list is specified with the c command, and assignment is specified with the “<-” symbols. Another term used to describe the list of numbers is to call it a “vector.”

The numbers within the c command are separated by commas. As an example, we can create a new variable, called “bubba” which will contain the numbers 3, 5, 7, and 9:

In [2]:
bubba = [3, 5, 7, 9]

4-element Array{Int64,1}:
 3
 5
 7
 9

When you enter this command you should not see any output except a new command line. The command creates a list of numbers called “bubba.” To see what numbers is included in bubba type “bubba” and press the enter key:

In [3]:
bubba

4-element Array{Int64,1}:
 3
 5
 7
 9

If you wish to work with one of the numbers you can get access to it using the variable and then square brackets indicating which number:

In [4]:
bubba[1]

3

In [5]:
bubba[2]

5

In [6]:
bubba[3]

7

In [7]:
bubba[4]

9

Notice that the first entry is referred to as the number 1 entry, and the zero entry can be used to indicate how the computer will treat the data. You can store strings using both single and double quotes, and you can store real numbers.

You now have a list of numbers and are ready to explore. In the chapters that follow we will examine the basic operations in R that will allow you to do some of the analyses required in class.

## 1.2\. Reading a CSV file

Unfortunately, it is rare to have just a few data points that you do not mind typing in at the prompt. It is much more common to have a lot of data points with complicated relationships. Here we will examine how to read a data set from a file using the read.csv function but first discuss the format of a data file.

We assume that the data file is in the format called “comma separated values” (csv). That is, each line contains a row of values which can be numbers or letters, and each value is separated by a comma. We also assume that the very first row contains a list of labels. The idea is that the labels in the top row are used to refer to the different columns of values.

First we read a very short, somewhat silly, data file. The data file is called [simple.csv](_static/simple.csv) and has three columns of data and six rows. The three columns are labeled “trial,” “mass,” and “velocity.” We can pretend that each row comes from an observation during one of two trials labeled “A” and “B.” A copy of the data file is shown below and is created in defiance of Werner Heisenberg:

<span id="index-2"></span>
<table class="docutils" id="id1" border="1"><caption><span class="caption-text">silly.csv</span></caption> <colgroup><col width="33%"> <col width="33%"> <col width="33%"></colgroup> 
<thead valign="bottom">
<tr class="row-odd">
<th class="head">trial</th>
<th class="head">mass</th>
<th class="head">velocity</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even">
<td>A</td>
<td>10</td>
<td>12</td>
</tr>
<tr class="row-odd">
<td>A</td>
<td>11</td>
<td>14</td>
</tr>
<tr class="row-even">
<td>B</td>
<td>5</td>
<td>8</td>
</tr>
<tr class="row-odd">
<td>B</td>
<td>6</td>
<td>10</td>
</tr>
<tr class="row-even">
<td>A</td>
<td>10.5</td>
<td>13</td>
</tr>
<tr class="row-odd">
<td>B</td>
<td>7</td>
<td>11</td>
</tr>
</tbody>
</table>

The command to read the data file is _read.csv_. We have to give the command at least one arguments, but we will give three different arguments to indicate how the command can be used in different situations. The first argument is the name of file. The second argument indicates whether or not the first row is a set of labels. The third argument indicates that there is a comma between each number of each line. The following command will read in the data and assign it to a variable called “heisenberg:”

In [8]:
using DataFrames
heisenberg = readtable("simple.csv"; header = true, separator = ',');

In [9]:
heisenberg

Unnamed: 0,trial,mass,velocity
1,A,10.0,12
2,A,11.0,14
3,B,5.0,8
4,B,6.0,10
5,A,10.5,13
6,B,7.0,11


In [10]:
describe(heisenberg)

trial
Length  6
Type    UTF8String
NAs     0
NA%     0.0%
Unique  2

mass
Min      5.0
1st Qu.  6.25
Median   8.5
Mean     8.25
3rd Qu.  10.375
Max      11.0
NAs      0
NA%      0.0%

velocity
Min      8.0
1st Qu.  10.25
Median   11.5
Mean     11.333333333333334
3rd Qu.  12.75
Max      14.0
NAs      0
NA%      0.0%



(Note that if you are using a Microsoft system the file naming convention is different from what we use here. If you want to use a backslash it needs to be escaped, i.e. use two backslashes together “\.” Also you can specify what folder to use by clicking on the “File” option in the main menu and choose the option to specify your working directory.)


To get more information on the different options available you can use the help command:


In [11]:
?readtable

search: readtable



No documentation found.

`DataFrames.readtable` is a `Function`.

```julia
# 3 methods for generic function "readtable":
readtable(io::IO) at /home/scottc/.julia/v0.5/DataFrames/src/dataframe/io.jl:825
readtable(io::IO, nbytes::Integer) at /home/scottc/.julia/v0.5/DataFrames/src/dataframe/io.jl:825
readtable(pathname::AbstractString) at /home/scottc/.julia/v0.5/DataFrames/src/dataframe/io.jl:894
```


If R is not finding the file you are trying to read then it may be looking in the wrong folder/directory. If you are using the graphical interface you can change the working directory from the file menu. If you are not sure what files are in the current working directory you can use the _dir()_ command to list the files and the _getwd()_ command to determine the current working directory:


In [12]:
readdir()

5-element Array{ByteString,1}:
 "fixedWidth.dat"    
 "Input.ipynb"       
 ".ipynb_checkpoints"
 "simple.csv"        
 "trees91.csv"       

In [13]:
pwd()

"/home/scottc/workspace/JuliaTutorial/Input"

The variable “heisenberg” contains the three columns of data. Each column is assigned a name based on the header (the first line in the file). You can now access each individual column using a “\$” to separate the two names:


In [14]:
heisenberg[:trial]

6-element DataArrays.DataArray{UTF8String,1}:
 "A"
 "A"
 "B"
 "B"
 "A"
 "B"

In [15]:
heisenberg[:mass]

6-element DataArrays.DataArray{Float64,1}:
 10.0
 11.0
  5.0
  6.0
 10.5
  7.0

In [16]:
heisenberg[:velocity]

6-element DataArrays.DataArray{Int64,1}:
 12
 14
  8
 10
 13
 11

If you are not sure what columns are contained in the variable you can use the names command:


In [17]:
names(heisenberg)

3-element Array{Symbol,1}:
 :trial   
 :mass    
 :velocity

We will look at another example which is used throughout this tutorial. we will look at the data found in a spreadsheet located at [http://cdiac.ornl.gov/ftp/ndp061a/trees91.wk1](http://cdiac.ornl.gov/ftp/ndp061a/trees91.wk1). A description of the data file is located at [http://cdiac.ornl.gov/ftp/ndp061a/ndp061a.txt](http://cdiac.ornl.gov/ftp/ndp061a/ndp061a.txt). The original data is given in an excel spreadsheet. It has been converted into a csv file, [trees91.csv](_static/trees91.csv) , by deleting the top set of rows and saving it as a “csv” file. This is an option to save within excel. (You should save the file on your computer.) It is a good idea to open this file in a spreadsheet and look at it. This will help you make sense of how R stores the data.

The data is used to indicate an estimate of biomass of ponderosa pine in a study performed by Dale W. Johnson, J. Timothy Ball, and Roger F. Walker who are associated with the Biological Sciences Center, Desert Research Institute, P.O. Box 60220, Reno, NV 89506 and the Environmental and Resource Sciences College of Agriculture, University of Nevada, Reno, NV 89512\. The data is consists of 54 lines, and each line represents an observation. Each observation includes measurements and markers for 28 different measurements of a given tree. For example, the first number in each row is a number, either 1, 2, 3, or 4, which signifies a different level of exposure to carbon dioxide. The sixth number in every row is an estimate of the biomass of the stems of a tree. Note that the very first line in the file is a list of labels used for the different columns of data.

The data can be read into a variable called “tree” in using the read.csv command:

In [18]:
tree = readtable("trees91.csv"; header = true, separator = ',');

This will create a new variable called “tree.” If you type in “tree” at the prompt and hit enter, all of the numbers stored in the variable will be printed out. Try this, and you should see that it is difficult to make any sense out of the numbers.

There are many different ways to keep track of data in R. When you use the _read.csv_ command R uses a specific kind of variable called a “data frame.” All of the data are stored within the data frame as separate columns. If you are not sure what kind of variable you have then you can use the attributes command. This will list all of the things that R uses to describe the variable:

In [19]:
describe(tree)

C
Min      1.0
1st Qu.  2.0
Median   2.0
Mean     2.5185185185185186
3rd Qu.  3.0
Max      4.0
NAs      0
NA%      0.0%

N
Min      1.0
1st Qu.  1.0
Median   2.0
Mean     1.9259259259259258
3rd Qu.  3.0
Max      3.0
NAs      0
NA%      0.0%

CHBR
Length  54
Type    UTF8String
NAs     0
NA%     0.0%
Unique  30

REP
Min      1.0
1st Qu.  9.0
Median   14.0
Mean     13.046511627906977
3rd Qu.  20.0
Max      20.0
NAs      11
NA%      20.37%

LFBM
Min      0.13
1st Qu.  0.48
Median   0.72
Mean     0.7649074074074075
3rd Qu.  1.0075
Max      1.76
NAs      0
NA%      0.0%

STBM
Min      0.03
1st Qu.  0.19
Median   0.245
Mean     0.28833333333333333
3rd Qu.  0.38
Max      0.72
NAs      0
NA%      0.0%

RTBM
Min      0.12
1st Qu.  0.28250000000000003
Median   0.445
Mean     0.4662037037037037
3rd Qu.  0.55
Max      1.51
NAs      0
NA%      0.0%

LFNCC
Min      0.88
1st Qu.  1.3125
Median   1.55
Mean     1.5598148148148145
3rd Qu.  1.7875
Max      2.76
NAs      0
NA%      0.0%

STNCC
Min      0.3

The first thing that R stores is a list of names which refer to each column of the data. For example, the first column is called “C”, the second column is called “N.” Tree is of type data.frame. Finally, the rows are numbered consecutively from 1 to 54\. Each column has 54 numbers in it.


If you know that a variable is a data frame but are not sure what labels are used to refer to the different columns you can use the names command:


In [20]:
show(names(tree))

[:C,:N,:CHBR,:REP,:LFBM,:STBM,:RTBM,:LFNCC,:STNCC,:RTNCC,:LFBCC,:STBCC,:RTBCC,:LFCACC,:STCACC,:RTCACC,:LFKCC,:STKCC,:RTKCC,:LFMGCC,:STMGCC,:RTMGCC,:LFPCC,:STPCC,:RTPCC,:LFSCC,:STSCC,:RTSCC]

If you want to work with the data in one of the columns you give the name of the data frame, a “\$” sign, and the label assigned to the column. For example, the first column in tree can be called using “tree\$C:”


In [21]:
@show tree[:C];

tree[:C] = [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4]


## 1.3\. Brief Note on Fixed Width Files


There are many ways to read data using R. We only give two examples, direct assignment and reading csv files. However, another way deserves a brief mention. It is common to come across data that is organized in flat files and delimited at preset locations on each line. This is often called a “fixed width file.”


The command to deal with these kind of files is _read.fwf_. Examples of how to use this command are not explored here, but a brief example is given. If you would like more information on how to use this command enter the following command:


In [22]:
help(read.fwf)

LoadError: LoadError: UndefVarError: help not defined
while loading In[22], in expression starting on line 1

The _read.fwf_ command requires at least two options. The first is the name of the file and the second is a list of numbers that gives the length of each column in the data file. A negative number in the list indicates that the column should be skipped. Here we give the command to read the data file [fixedWidth.dat](_static/fixedWidth.dat) . In this data file there are three columns. The first colum is 17 characters wide, the second column is 15 characters wide, and the last column is 7 characters wide. In the example below we use the optional _col.names_ option to specify the names of the columns:


In [23]:
# a = read.fwf('fixedWidth.dat',widths=c(-17,15,7),col.names=c('temp','offices'))
a = readtable("fixedWidth.dat"; header = false, separator = ' ', names = [:name, :temp, :offices])

LoadError: LoadError: UndefVarError: skip_white not defined
while loading In[23], in expression starting on line 2

In [24]:
a

LoadError: LoadError: UndefVarError: a not defined
while loading In[24], in expression starting on line 1