# Working with tabular data in  R

[Original source](https://cengel.github.io/R-intro/)

Learning Objectives

* Load external data from a `.csv` file into a `data.frame` in R with `read.csv()`.
* Find basic properties of a data frames including size, class or type of the columns, names of rows and columns by using `str()`, `nrow()`, `ncol()`, `dim()`, `length()`, `colnames()`, `rownames()`.
* Use `head()` and `tail()` to inspect rows of a data frame.
* Generate _summary statistics_ for a data frame.
* Use _indexing_ to select rows and columns.
* Use _logical conditions_ to select rows and columns.
* Add columns and rows to a data frame.
* Manipulate categorical data with _factors_, `levels()` and `as.character()`.
* Change how character strings are handled in a data frame.
* Format dates in R and calculate time differences.
* Use `df$new_col <- new_col` to add a new column to a data frame.
* Use `cbind()` to add a new column to a data frame.
* Use `rbind()` to add a new row to a data frame.
* Use `na.omit()` to remove rows from a data frame with NA values.

## Loading tabular data

We will be working a sample dataset from their repository [Stanford Open Policing Project](https://openpolicing.stanford.edu/data/). It contains information about traffic stops for blacks and whites in the state of Mississippi during January 2013 to mid-July of 2016.

We are going to use the R function `download.file()` to download the CSV file that contains the traffic stop data, and we will use `read.csv()` to load into memory the content of the CSV file as an object of class `data.frame`.

In [2]:
# To download the data into your local data/ subdirectory, run the following:
download.file("https://github.com/cengel/R-intro/raw/master/data/MS_trafficstops_bw.csv",
              "../../Datasets/MS_trafficstops_bw.csv")

In [3]:
# Load the data:
trafficstops <- read.csv('../../Datasets/MS_trafficstops_bw.csv')

This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can print the variable’s value: `trafficstops`.

In [4]:
head(trafficstops)

id,state,stop_date,county_name,county_fips,police_department,driver_gender,driver_birthdate,driver_race,violation_raw,officer_id
MS-2013-00001,MS,2013-01-01,Jones County,28067,Mississippi Highway Patrol,M,1950-06-14,Black,Seat belt not used properly as required,J042
MS-2013-00002,MS,2013-01-01,Lauderdale County,28075,Mississippi Highway Patrol,M,1967-04-06,Black,Careless driving,B026
MS-2013-00003,MS,2013-01-01,Pike County,28113,Mississippi Highway Patrol,M,1974-04-15,Black,Speeding - Regulated or posted speed limit and actual speed,M009
MS-2013-00004,MS,2013-01-01,Hancock County,28045,Mississippi Highway Patrol,M,1981-03-23,White,Speeding - Regulated or posted speed limit and actual speed,K035
MS-2013-00005,MS,2013-01-01,Holmes County,28051,Mississippi Highway Patrol,M,1992-08-03,White,Speeding - Regulated or posted speed limit and actual speed,D028
MS-2013-00006,MS,2013-01-01,Jackson County,28059,Mississippi Highway Patrol,F,1960-05-02,White,Speeding - Regulated or posted speed limit and actual speed,K023


## Inspecting data.frame Objects

A data frame in R is a special case of a list, and a representation of data where the columns are vectors that all have the same length. Because the columns are vectors, they all contain the same type of data (e.g., characters, integers, factors, etc.).

In [5]:
#We can see this when inspecting the structure of a data frame with the function `str():` 
str(trafficstops)

'data.frame':	211211 obs. of  11 variables:
 $ id               : Factor w/ 211211 levels "MS-2013-00001",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ state            : Factor w/ 1 level "MS": 1 1 1 1 1 1 1 1 1 1 ...
 $ stop_date        : Factor w/ 1288 levels "2013-01-01","2013-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ county_name      : Factor w/ 82 levels "Adams County",..: 34 38 57 23 26 30 30 22 26 26 ...
 $ county_fips      : int  28067 28075 28113 28045 28051 28059 28059 28043 28051 28051 ...
 $ police_department: Factor w/ 1 level "Mississippi Highway Patrol": 1 1 1 1 1 1 1 1 1 1 ...
 $ driver_gender    : Factor w/ 3 levels "","F","M": 3 3 3 3 3 2 2 2 3 3 ...
 $ driver_birthdate : Factor w/ 21423 levels "","1930-01-11",..: 3558 9575 12137 14670 18820 7061 4504 19135 2755 15878 ...
 $ driver_race      : Factor w/ 3 levels "","Black","White": 2 2 2 3 3 3 3 3 3 3 ...
 $ violation_raw    : Factor w/ 19 levels "??","Careless driving",..: 17 2 19 19 19 19 19 19 19 19 ...
 $ officer_id       : Factor

The functions `head()` and `str()` can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. 

### Size

In [8]:
# returns a vector with he dimensions of the object (rows and columns)
dim(trafficstops)

In [9]:
# returns the number of rows
nrow(trafficstops)

In [10]:
# returns the number of columns
ncol(trafficstops)

In [11]:
# returns number of columns
length(trafficstops)

### Content

In [15]:
# shows the first 6 rows
head(trafficstops, n=3)

id,state,stop_date,county_name,county_fips,police_department,driver_gender,driver_birthdate,driver_race,violation_raw,officer_id
MS-2013-00001,MS,2013-01-01,Jones County,28067,Mississippi Highway Patrol,M,1950-06-14,Black,Seat belt not used properly as required,J042
MS-2013-00002,MS,2013-01-01,Lauderdale County,28075,Mississippi Highway Patrol,M,1967-04-06,Black,Careless driving,B026
MS-2013-00003,MS,2013-01-01,Pike County,28113,Mississippi Highway Patrol,M,1974-04-15,Black,Speeding - Regulated or posted speed limit and actual speed,M009


In [14]:
# shows the last 6 rows
tail(trafficstops, n = 3)

Unnamed: 0,id,state,stop_date,county_name,county_fips,police_department,driver_gender,driver_birthdate,driver_race,violation_raw,officer_id
211209,MS-2016-24295,MS,2016-07-11,Grenada County,28043,Mississippi Highway Patrol,M,1998-02-02,White,Seat belt not used properly as required,D014
211210,MS-2016-24296,MS,2016-07-14,Copiah County,28029,Mississippi Highway Patrol,F,1970-06-14,White,Expired or no non-commercial driver license or permit,C015
211211,MS-2016-24297,MS,2016-07-14,Copiah County,28029,Mississippi Highway Patrol,M,1948-03-11,White,Seat belt not used properly as required,C015


### Names

In [16]:
# returns the column names (synonym of colnames() for data.frame objects)
names(trafficstops) 

In [18]:
# returns the row names
#rownames(trafficstops)

### Summary

In [19]:
# structure of the object and information about the class, length and content of each column
str(trafficstops)

'data.frame':	211211 obs. of  11 variables:
 $ id               : Factor w/ 211211 levels "MS-2013-00001",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ state            : Factor w/ 1 level "MS": 1 1 1 1 1 1 1 1 1 1 ...
 $ stop_date        : Factor w/ 1288 levels "2013-01-01","2013-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ county_name      : Factor w/ 82 levels "Adams County",..: 34 38 57 23 26 30 30 22 26 26 ...
 $ county_fips      : int  28067 28075 28113 28045 28051 28059 28059 28043 28051 28051 ...
 $ police_department: Factor w/ 1 level "Mississippi Highway Patrol": 1 1 1 1 1 1 1 1 1 1 ...
 $ driver_gender    : Factor w/ 3 levels "","F","M": 3 3 3 3 3 2 2 2 3 3 ...
 $ driver_birthdate : Factor w/ 21423 levels "","1930-01-11",..: 3558 9575 12137 14670 18820 7061 4504 19135 2755 15878 ...
 $ driver_race      : Factor w/ 3 levels "","Black","White": 2 2 2 3 3 3 3 3 3 3 ...
 $ violation_raw    : Factor w/ 19 levels "??","Careless driving",..: 17 2 19 19 19 19 19 19 19 19 ...
 $ officer_id       : Factor

In [20]:
# summary statistics for each column
summary(trafficstops)

             id         state            stop_date     
 MS-2013-00001:     1   MS:211211   2013-07-03:   491  
 MS-2013-00002:     1               2015-11-25:   488  
 MS-2013-00003:     1               2015-12-31:   483  
 MS-2013-00004:     1               2015-09-07:   436  
 MS-2013-00005:     1               2013-07-04:   430  
 MS-2013-00006:     1               2013-05-26:   420  
 (Other)      :211205               (Other)   :208463  
            county_name      county_fips                     police_department 
 Monroe County    : 10469   Min.   :28001   Mississippi Highway Patrol:211211  
 Lauderdale County:  8795   1st Qu.:28043                                      
 Jackson County   :  6759   Median :28075                                      
 Harrison County  :  6096   Mean   :28077                                      
 Copiah County    :  6086   3rd Qu.:28113                                      
 Hinds County     :  4993   Max.   :28163                               