# Introduction to R Part 10: Reading and Writing Data

Reading data into your R environment is first step in conducting data analysis. Data comes in many different forms and although R is equipped to deal with most data formats, this lesson will focus on reading common data formats like comma separated values files (CSV) and Microsoft Excel files.

### R Working Directory and File Paths

Before we can jump in and starting loading data, we need to learn a little bit about R's working directory and file paths. When you run R, it starts in a default location in your computer's file system called the working directory. You can check your working directory with the getwd() function:

In [1]:
getwd()                                    #Get the current working directory

The working directory acts as your starting point for accessing other files on your computer. To load data into R from your hard disk, you either need to put the data file in your working directory, change your working directory to the folder containing the data or supply the data's file path to the data reading function.

You can change your working directory by supplying a new file path in quotes to the setwd() function:

In [2]:
setwd("C:/Users/Greg/Desktop")            #Set a new working directory

getwd()                                   #Check the working directory again

*Note: you can use forward slashes for your file path even in Windows which normally uses backslashes. If you want to use backslashes for file paths in Windows you should use use double backslashes (\\)

Instead of worrying about slashes in filepaths, you can have R construct file paths for you using the file.path() function. It takes a comma separated sequence of character strings and then uses them to construct a file path string for you:

In [3]:
my_path <- file.path("C:","Users","Greg","Desktop","Kaggle")      #Construct the file path

print (my_path )                #Check the path

setwd(my_path)                  #Set the working directory to the path

getwd()                         #Check the working directory again

[1] "C:/Users/Greg/Desktop/Kaggle"


In RStudio you can also change the the working directory under the "Session" dropdown menu. Under session select "Set working directory", "Choose Directory", navigate to the folder you want to set as your working directory and click "Select folder."

You can list the files and folders in the current working directory using the list.files() function:

In [4]:
list.files()                  #A list of files and folders in my Kaggle directory

### Read CSV and TSV Files

Data is commonly stored in simple text files consisting of values delimited by some by a special character. For instance, CSV files use commas as the delimiter and tab separated value files (TSV) use tabs as the delimiter.

You can use the read.csv() function to read CSV files into R:

In [5]:
draft <- read.csv(file ="draft2015.csv",         #path to the file from the working directory      
                  stringsAsFactors = FALSE)      #whether to encode characters as factors



print(head(draft,15))

                Player Draft_Express CBS CBS_2 CBS_3 BleacherReport SI
1   Karl-Anthony Towns             1   1     1     1              1  1
2        Jahlil Okafor             2   2     2     2              2  2
3      Emmanuel Mudiay             7   6     6     6              7  6
4     D'Angelo Russell             3   3     4     4              3  3
5   Kristaps Porzingis             6   5     3     3              4  4
6        Mario Hezonja             4   7     8     7              6  7
7      Justise Winslow             5   4     5     5              5  5
8  Willie Cauley-Stein            13   9     7    11              9 11
9      Stanley Johnson             8   8    12     8              8 10
10        Myles Turner            12  10    13    12             11 12
11        Bobby Portis            17  15    17    20             17 15
12        Devin Booker            10  11     9    13             13  8
13      Frank Kaminsky             9  12    10     9             12  9
14    

Data loaded into R via read.csv() becomes data frame. 

To load tab separated values, include the sep argument and set it to the tab character "\t":

In [6]:
draft2 <- read.csv(file="draft2015.tsv",      #path to the TSV file
          sep = "\t",                         #use tabs as the delimiting character
          stringsAsFactors = FALSE)

print(head(draft2,15))

                Player Draft_Express CBS CBS_2 CBS_3 BleacherReport SI
1   Karl-Anthony Towns             1   1     1     1              1  1
2        Jahlil Okafor             2   2     2     2              2  2
3      Emmanuel Mudiay             7   6     6     6              7  6
4     D'Angelo Russell             3   3     4     4              3  3
5   Kristaps Porzingis             6   5     3     3              4  4
6        Mario Hezonja             4   7     8     7              6  7
7      Justise Winslow             5   4     5     5              5  5
8  Willie Cauley-Stein            13   9     7    11              9 11
9      Stanley Johnson             8   8    12     8              8 10
10        Myles Turner            12  10    13    12             11 12
11        Bobby Portis            17  15    17    20             17 15
12        Devin Booker            10  11     9    13             13  8
13      Frank Kaminsky             9  12    10     9             12  9
14    

The read.csv() function is an extension of a more general data reading function called read.table(). read.csv() just sets a few arguments of read.table() to values suitable for reading CSV and TSV files. The read.table() function has numerious additional arguments that have various effects on reading data; there are too many arguments to cover them all in detail here but you can always get more information by checking the function documents with ?read.table or help(read.table).

### Read Excel Files

Microsoft Excel is a ubiquitous enterprise spreadsheet program that stores data in its own format with the extension .xls or .xlsx. 

One simple way to read Excel data into R is to open an Excel workbook using Excel, save the data in CSV format or as a tab-delimited text file and then use the read.csv() function to load the data into R. 

If you want to read data from a .xls or .xlsx file directly into R, you'll need to download a package. Packages are extensions to the base R software library that give you access to additional functions. You can install packages from CRAN by supplying the name of the package to the install.packages() function. To read Excel Files, we need the "xlsx" package. When you attempt to install a package in RStudio you will be prompted to select a web mirror; choose one close you.

In [7]:
install.packages("xlsx",  repos='http://cran.us.r-project.org')   #install the xlsx package from CRAN

package 'xlsx' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Greg\AppData\Local\Temp\RtmpOwcCd4\downloaded_packages


*Note: I had to supply a CRAN mirror manually since I'm using a program that makes it easy to export text and code to a web friendly format instead of RStudio.

*Note: when you install a package, it may have dependencies that have to be installed first.

After installing a package, you can load it into your R environment with the library() function:

In [8]:
library(xlsx)                            #library() loads in a package and its dependencies

Loading required package: rJava
Loading required package: xlsxjars


With our new pacakge in hand, we can use its read.xlsx() function to read Excel files directly:

In [9]:
draft3 <- read.xlsx("draft2015.xlsx", 1)  #Reads the first worksheet in the file draft2015.xlsx

print(head(draft3))

              Player Draft_Express CBS CBS_2 CBS_3 BleacherReport SI
1 Karl-Anthony Towns             1   1     1     1              1  1
2      Jahlil Okafor             2   2     2     2              2  2
3    Emmanuel Mudiay             7   6     6     6              7  6
4   D'Angelo Russell             3   3     4     4              3  3
5 Kristaps Porzingis             6   5     3     3              4  4
6      Mario Hezonja             4   7     8     7              6  7


If you want to read a specific worksheet in an excel workbook, supply the sheetName argument:

In [10]:
dummy_data <- read.xlsx("draft2015.xlsx", 
                        sheetName="dummy_data")    #Loads in the specified worksheet

print(dummy_data)

        This Is Dummy Data
1  sometimes  2     0 fast
2    missing  4     0 fast
3       <NA>  7     1 slow
4       data  5     1 slow
5         is  3     0 fast
6       <NA>  4     0 slow
7  sometimes  6     0 fast
8  sometimes  5    NA slow
9       <NA>  5    NA fast
10   missing  4     0 fast


### Reading Web Data

The internet gives you access to more data than you could ever hope to analyze. Data analysis often begins with getting data from the web and loading it into R. Websites that offer data for download usually let you download data as CSV, TSV or excel files. 

The easiest way to use web data in R, is to simply download data to your hard drive in CSV, TSV or an excel file format and then use the functions we discussed earlier to load the data into R. You can supply a url to read.csv() or read.table() to read data directly from the web, but doing so can be problematic since web data isn't always formatted nicely. You may need to do a little manual data cleaning, such as deleting unnecessary titles and header rows in excel or a text editor like notepad++ to prepare data for use in R.

Reading from the clipboard is another option for reading web data and other tabular data. To read in data from the clipboard, highlight the data you want to copy and use the approprate copy function as if you were going to copy and paste the data. Next, use the read.csv() or read.table() function with the the first argument set to "clipboard":

In [11]:
#First I went to http://www.basketball-reference.com/leagues/NBA_2015_totals.html and
#clicked the CSV button to format data as CSV and then copied some data to the clipboard

BB_reference_data <- read.csv("clipboard")       #Read data from the clipboard

print ( head(BB_reference_data, 10) )            #Check the data

   Rk         Player Pos Age  Tm  G GS   MP  FG FGA   FG. X3P X3PA X3P.1 X2P
1   1     Quincy Acy  PF  24 NYK 68 22 1287 152 331 0.459  18   60 0.300 134
2   2   Jordan Adams  SG  20 MEM 30  0  248  35  86 0.407  10   25 0.400  25
3   3   Steven Adams   C  21 OKC 70 67 1771 217 399 0.544   0    2 0.000 217
4   4    Jeff Adrien  PF  28 MIN 17  0  215  19  44 0.432   0    0    NA  19
5   5  Arron Afflalo  SG  29 TOT 78 72 2502 375 884 0.424 118  333 0.354 257
6   5  Arron Afflalo  SG  29 DEN 53 53 1750 281 657 0.428  82  243 0.337 199
7   5  Arron Afflalo  SG  29 POR 25 19  752  94 227 0.414  36   90 0.400  58
8   6  Alexis Ajinca   C  26 NOP 68  8  957 181 329 0.550   0    0    NA 181
9   7 Furkan Aldemir  PF  23 PHI 41  9  540  40  78 0.513   0    5 0.000  40
10  8   Cole Aldrich   C  26 NYK 61 16  976 144 301 0.478   0    0    NA 144
   X2PA X2P.1  eFG.  FT FTA   FT. ORB DRB TRB AST STL BLK TOV  PF  PTS
1   271 0.494 0.486  76  97 0.784  79 222 301  68  27  22  60 147  398
2    61 0.4

Data comes in all sorts of formats other than the friendly ones we've discussed thus far. R has functions and packages for working with data in other common data formats like SAS, SPSS and Stata files, json, xml, html and databases. It would take too long to cover how to deal with all the different data sources you might encounter in one lesson. If you encounter a data source you don't know how to work with, a little bit of Googling will usually reveal how to convert it into a more familiar format or use an R package to deal with it directly.

### Writing Data To CSV

In the course of cleaning data, data analysis and predictive modeling, you'll generate new data. You can write data in an R data frame to CSV using the write.csv() function:

In [12]:
write.csv(BB_reference_data,          #name of varible assigned to the data       
         "BB_data.csv",               #name of the file to create to store the data
          row.names = FALSE,)         #whether to include row names in the file

It's a good idea to save data periodically, especially after long, computationally expensive operations so that you don't lose progress or results.

Now that we know the basics of reading and writing data we're ready to start exploring data.

### Next Time: Introduction to R Part 11: Initial Data Exploration