<br> 
<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>


## Course: Data-Driven Management and Policy

### Prof. José Manuel Magallanes, PhD 

_____

# Session 3: Data Structures





<a id='beginning'></a>

Before starting, keep in mind the following ideas:

* Computers and  humans need some structure in their language to communicate.
* Different from humans, we should not allowed the computer to guess what we mean. Then, talking to the computer has to follow a particular set of rules so our orders are unambiguos. 
* Errors happen when we do not speak clearly to the computer; but it is worse if the computer does something we did not mean.
* Data structures is the way the computer organizes pieces of data, so it can be stored, retrieved, used and modified.


We are going to talk about 3 data structures in R:

1. [Lists.](#part1) 
2. [Vectors.](#part2) 
3. [Data Frame.](#part3) 

**Lists** and **vectors** are simple structures; a **data frame** is a more complex one (built from the simple ones). 

----

<a id='part1'></a>

## List

Lists are containers of values. The values can be of any kind (numbers or non-numbers), and even other containers (simple or complex). 

If we have an **spreadsheet** as a reference, a row is a 'natural' list.

![](spreadSheet.png)

Then this can be a list:

In [1]:
DetailStudent=list("Fred Meyers",
                   40,
                   FALSE)

The *object* _DetailStudent_ serves to store _temporarily_ the list in the computer. To name a list, use combinations of letters and numbers in a meaningful way (do not start with a number or a special character).

Typing the name of the object _DetailStudent_, now representing a list, will give you all the contents you saved in there:


In [2]:
DetailStudent

The list above has three elements. However, you may be wondering if those elements have a meaning altogether. In those situations, it is better to have names for each elements.


In [3]:
DetailStudent=list(fullName="Fred Meyers",
                   age=40,
                   female=FALSE)

In [4]:
# seeing the result
DetailStudent

This list has three elements, which we can also call _fields_. Each of these, in this case, holds a different data type:

* *FullName* holds characters
* *age* holds a number
* *female* holds a logical (Boolean) value.

You can access any of those elements using these approaches:

In [5]:
# position
DetailStudent[[1]]

In [6]:
# name of the field
DetailStudent[['fullName']]

In [7]:
# name of the field
DetailStudent$fullName

If you do not have _names_ for the list fields, you can only access them using positions:

In [8]:
NewList=list('a','b','c','d',1,2,3)
NewList[[1]]

Once you access an element, you can alter it:

In [9]:
DetailStudent[[1]]='Alfred Mayer'
# Then:
DetailStudent

You can even add an totally NEW field like this:

In [10]:
DetailStudent$city='Seattle'

# show:
DetailStudent

And destroy it by **NULL**ing it, like this:

In [11]:
DetailStudent$city=NULL # do you like: DetailStudent[[4]]=NULL
DetailStudent

You can get rid of a list using:

In [12]:
rm(DetailStudent)
DetailStudent

ERROR: Error in eval(expr, envir, enclos): object 'DetailStudent' not found


** How would you create a list of this person out of his personal information data?**


<img src="listExample.png" alt="Drawing" style="width: 200px;"/>

In [13]:
cr7=list('FullName'='Cristiano Ronaldo dos Santos Aveiro', 
         'DateOfBirth'='5 February 1985',
         'PlaceOfBirth'='Funchal, Madeira, Portugal',
         'HeightInMeters'=1.89,
         'PlayingPosition'='Forward'
        )

In [14]:
#seeing the result:
cr7

The previous list has nothing wrong. But keep in mind that we save data to retrieve it and act (decide) upon its value. For example, can we answer the question:

* What is Ronaldo's playing position?

In [15]:
cr7$PlayingPosition

Great! However, we can not answer, directly:

* How old is he?



In [16]:
# what is today? 
today - cr7$DateOfBirth

ERROR: Error in eval(expr, envir, enclos): object 'today' not found


Right way:

In [17]:
Sys.Date()

In [18]:
# Then,
Sys.Date() - cr7$DateOfBirth

ERROR: Error in unclass(as.Date(e1)) - e2: non-numeric argument to binary operator


The problem is that _DateOfBirth_ is not a date, is simply a text.

In [19]:
cr7$DateOfBirth; str(cr7$DateOfBirth)

 chr "5 February 1985"


In [20]:
# udpating
# some may need: Sys.setlocale("LC_TIME", "English")

cr7$DateOfBirth=as.Date(cr7$DateOfBirth,format="%d %B %Y");str(cr7$DateOfBirth)

 Date[1:1], format: "1985-02-05"


Using the right [format](https://www.statmethods.net/input/dates.html) will allow you to accomplish what you need:

In [21]:
# then

Sys.Date()-cr7$DateOfBirth

Time difference of 12487 days

Or, in a simpler way (with the help of lubridate package):

In [23]:
library(lubridate)

# how many years:
# notice I am using 2 functions: interval and time_length

time_length(interval(cr7$DateOfBirth,Sys.Date()),"years")

[Go to page beginning](#beginning)

----

<a id='part2'></a> 

## Vectors
Vectors are also containers of values. The values should be of only __one__ type (__R__ may alter or _coerce_ them silently, otherwise). If we have an spreadsheet as a reference, a column can be a natural vector.

![](spreadSheet_col.png)

Here, we will create three vectors using the "**c(...)**" function: 

In [24]:
fullnames=c("Fred Meyers","Sarah Jones", "Lou Ferrigno","Sky Turner")
ages=c(40,35, 60,77)
female=c(F,T,T,T)

Each *object* is holding temporarily a vector. Use combinations of letters and numbers  in a meaningful way to name a vector (never start with a number or a special character). When typing the name of the object you will get all the contents:

In [25]:
fullnames

In [26]:
ages

In [27]:
female

Each vector is composed of elements with the same type. If you want to access individual elements, you can write:

In [28]:
fullnames[1]

In [29]:
# or
ages[1]

In [30]:
# or
female[1]

You can alter the vector using any of the above mechanisms:

In [32]:
fullnames[1]='Alfred Mayer'
# Then:
fullnames[1]

You can add an element to a vector like this:

In [33]:
elements=c(1,20,3)
elements=c(elements,40) # adding to the same one
elements

You can NOT delete it with NULL:

In [34]:
elements
elements[4]=NULL

ERROR: Error in elements[4] = NULL: replacement has length zero


Just do this:

In [35]:
# by position
elements
elements2=elements[-2] # vector 'without' position 2
elements2

In [36]:
# by value
elements3=elements[elements!=20]
elements3

You can get rid of those vectors using:

In [37]:
rm(elements2)
elements2

ERROR: Error in eval(expr, envir, enclos): object 'elements2' not found


Another operation is to get rid of repeated values, R will not complaint if they exist:

In [38]:
weekdays=c('M','T','W','Th','S','Su','Su')
weekdays

Then, use the function _unique_:

In [39]:
unique(weekdays)

Vector elements can have 'names', but their contents still need to be homogeneous:

In [40]:
newAges=c("Sam"=50, "Paul"=30, "Jim"="40")
newAges

As you see above, the presence of "Jim" as an element, *coerced* the other values to *characters* (the _numbers_ are now _text_, the symbol **''** is used to show that). Updating that value, will not change the vector type:

In [41]:
newAges["Jim"]=20
newAges

Updating the vlaue will not take away the initial coercion.

Then, you could tell explicitly to change the _mode_ of the vector:

In [42]:
storage.mode(newAges)

In [43]:
storage.mode(newAges)='double' # or integer
newAges

The more familiar function _as.numeric_ can be used, but that will also delete the field names:

In [44]:
newAges=as.numeric(newAges)
newAges

Notice that _as.numeric_ coerces text into missing values, if the text is not a number:

In [45]:
someData1=c(1,2,3,'4')
as.numeric(someData1)

But,

In [46]:
someData2=c(1,2,3,'O') # O not 0
as.numeric(someData2)

“NAs introduced by coercion”

You can use the **is.na** function to know  if some coercing may happen:

In [47]:
is.na(as.numeric(someData2))

“NAs introduced by coercion”

### Vectors versus Lists

Let me share some ideas for comparing these two basic structures:

__A) Make sure what you have:__

The functions **is.vector**, **is.list**, **is.character** and **is.numeric** should be used frequently, because we need to be sure of what structure we are dealing with:


In [48]:
aList=list(1,2,3)
aVector=c(1,2,3)

is.vector(aVector); is.list(aVector)

In [49]:
# then:
is.vector(aList,mode='vector'); is.list(aList)

The function **str** could be another alternative to find out what we have:


In [50]:
str(aVector)

 num [1:3] 1 2 3


In [51]:
str(aList)

List of 3
 $ : num 1
 $ : num 2
 $ : num 3


__B) Arithmetics:__

You will find great differences when doing arithmetics:

In [52]:
# if we have these vectors:
numbers1=c(1,2,3)
numbers2=c(10,20,30)
numbers3=c(5)
numbers4=c(1,10)

Then, these work well:

In [53]:
# adding element by element:
numbers1+numbers2

In [54]:
# adding 5  to all the elements of other vector:
numbers2+numbers3

In [55]:
# multiplication (element by element):
numbers1*numbers2

In [56]:
# and this kind of multiplication:
numbers1 * numbers3

However, R will give another warning here:

In [57]:
numbers1+numbers4 # different size matters!

“longer object length is not a multiple of shorter object length”

Comparissons make sense:

In [58]:
numbers1>numbers2

In [59]:
# but:
numbers1>numbers4

“longer object length is not a multiple of shorter object length”

Now, let's see how the previous operations work here. These are our lists:

In [60]:
numbersL1=list(11,22,33)
numbersL2=list(1,2,3)

...the _adding_ can not be interpreted:

In [61]:
numbersL1+numbersL2

ERROR: Error in numbersL1 + numbersL2: non-numeric argument to binary operator


... and neither the comparisons...

In [62]:
numbersL1>numbersL2

ERROR: Error in numbersL1 > numbersL2: comparison of these types is not implemented


So do not expect neither of these to work:

In [63]:
numbersL1*numbersL2

ERROR: Error in numbersL1 * numbersL2: non-numeric argument to binary operator


In [64]:
numbersL1*3

ERROR: Error in numbersL1 * 3: non-numeric argument to binary operator


[Go to page beginning](#beginning)

----

<a id='part3'></a>

## Data Frames

Data frames are containers of values. You use a data frame because you need to combine what vectors and lists do. The most common analogy is a data table like  the ones in a __spreadsheet__: 


In [65]:
# VECTORS
names=c("Qing", "Françoise", "Raúl", "Bjork")
ages=c(32,33,28,30)
country=c("China", "Senegal", "Spain", "Norway")
education=c("Bach", "Bach", "Master", "PhD")

#DF as a "List" of vectors:
students=data.frame(names,ages,country,education)
students

names,ages,country,education
Qing,32,China,Bach
Françoise,33,Senegal,Bach
Raúl,28,Spain,Master
Bjork,30,Norway,PhD


You see your data frame above. Just by watching, you can not be sure of what you have, so using **str** is highly recommended:

In [66]:
str(students)

'data.frame':	4 obs. of  4 variables:
 $ names    : Factor w/ 4 levels "Bjork","Françoise",..: 3 2 4 1
 $ ages     : num  32 33 28 30
 $ country  : Factor w/ 4 levels "China","Norway",..: 1 3 4 2
 $ education: Factor w/ 3 levels "Bach","Master",..: 1 1 2 3


This data frame uses the vector 'names' as the __row names__, so that vector is not considered a column, that is fine:

By default, R turns text vectors into factors (categorical values)You can avoid that by writing:

In [67]:
students=data.frame(names,ages,country,education,
                    stringsAsFactors=FALSE)
str(students)

'data.frame':	4 obs. of  4 variables:
 $ names    : chr  "Qing" "Françoise" "Raúl" "Bjork"
 $ ages     : num  32 33 28 30
 $ country  : chr  "China" "Senegal" "Spain" "Norway"
 $ education: chr  "Bach" "Bach" "Master" "PhD"


The function _str_ showed you the dimensions of the structure (number of rows and columns); R has alternative ways to get the dimensions:

In [68]:
dim(students)

In [69]:
#also
nrow(students) ; ncol(students) 

In [70]:
# and very important:
length(students)

We know _length_ works for vectors and lists. In data frames, it gives you number of columns, NOT rows. 

Data frames have the functions __head()__, which is very useful to show the top rows of the dataframe:

In [73]:
head(students,2) # top 2

names,ages,country,education
Qing,32,China,Bach
Françoise,33,Senegal,Bach


Of course, we have __tail__:

In [72]:
tail(students,2) # last 2

Unnamed: 0,names,ages,country,education
3,Raúl,28,Spain,Master
4,Bjork,30,Norway,PhD


You can access data frames elements in an easy way:

In [74]:
# one particular column
students$names

In [75]:
# two columns using positions
students[,c(1,4)]

names,education
Qing,Bach
Françoise,Bach
Raúl,Master
Bjork,PhD


In [76]:
## two columns using names of columns
students[,c('names','education')]

names,education
Qing,Bach
Françoise,Bach
Raúl,Master
Bjork,PhD


Using positions is the best way to get several columns:

In [77]:
students[,c(1,3:4)] # ':' is used to facilitate 'from-to' sequence

names,country,education
Qing,China,Bach
Françoise,Senegal,Bach
Raúl,Spain,Master
Bjork,Norway,PhD


Of course, you can create a new object with **subsets**:

In [78]:
studentsNoEd=students[,c(1:3)]
studentsNoEd

names,ages,country
Qing,32,China
Françoise,33,Senegal
Raúl,28,Spain
Bjork,30,Norway


You have a _summary_ function:

In [79]:
summary(students)

    names                ages         country           education        
 Length:4           Min.   :28.00   Length:4           Length:4          
 Class :character   1st Qu.:29.50   Class :character   Class :character  
 Mode  :character   Median :31.00   Mode  :character   Mode  :character  
                    Mean   :30.75                                        
                    3rd Qu.:32.25                                        
                    Max.   :33.00                                        

If you had the categorical value as a factor, you could get a frequecy table:

In [80]:
students$country=as.factor(students$country)
students$education=as.factor(students$education)

Then,

In [81]:
summary(students)

    names                ages          country   education
 Length:4           Min.   :28.00   China  :1   Bach  :2  
 Class :character   1st Qu.:29.50   Norway :1   Master:1  
 Mode  :character   Median :31.00   Senegal:1   PhD   :1  
                    Mean   :30.75   Spain  :1             
                    3rd Qu.:32.25                         
                    Max.   :33.00                         

You can modify any values in a data frame. Let me create a copy of this data frame to play with:


In [82]:
studentsCopy=students # I make a copy to avoid altering my original dataframe

Now, I can change the age of Qing to 23 replacing 32:

In [83]:
studentsCopy[1,2]=23
# change is immediate! (you will not get any warning)
studentsCopy[1,]

names,ages,country,education
Qing,23,China,Bach


We can set a column as **missing**:

In [84]:
studentsCopy$country=NA

In [85]:
studentsCopy

names,ages,country,education
Qing,23,,Bach
Françoise,33,,Bach
Raúl,28,,Master
Bjork,30,,PhD


And, delete a column by **null**ing it:

In [86]:
studentsCopy$ages=NULL

In [87]:
studentsCopy

names,country,education
Qing,,Bach
Françoise,,Bach
Raúl,,Master
Bjork,,PhD


### Querying Data Frames:

Once you have a data frame you can start writing interesting queries (notice the use of _commas_):

**Who is the oldest in the group?**

In [88]:
students[which.max(students$ages),] 

Unnamed: 0,names,ages,country,education
2,Françoise,33,Senegal,Bach


**Who is the youngest in the group?**

In [89]:
students[which.min(students$ages),] 

Unnamed: 0,names,ages,country,education
3,Raúl,28,Spain,Master


**Who is above 30 and from China?**

In [90]:
students[students$ages>30 & students$country=='China',] 

names,ages,country,education
Qing,32,China,Bach


**Who is not from Norway?**

In [91]:
students[students$country!="Norway",] 

names,ages,country,education
Qing,32,China,Bach
Françoise,33,Senegal,Bach
Raúl,28,Spain,Master


**Who is from one of these places?**

In [92]:
Places=c("Peru", "USA", "Spain")
students[students$country %in% Places,] 

Unnamed: 0,names,ages,country,education
3,Raúl,28,Spain,Master


In [93]:
# the opposite
students[!students$country %in% Places,] 

Unnamed: 0,names,ages,country,education
1,Qing,32,China,Bach
2,Françoise,33,Senegal,Bach
4,Bjork,30,Norway,PhD


**The education level of the one above 30 year old and from China?**

In [94]:
students[students$ages>30 & students$country=='China',]$education 

**Show me the data ordered by age (decreasing)?**

In [97]:
students[order(-ages),]

Unnamed: 0,names,ages,country,education
2,Françoise,33,Senegal,Bach
1,Qing,32,China,Bach
4,Bjork,30,Norway,PhD
3,Raúl,28,Spain,Master



----

* [Go to page beginning](#beginning)
* [Go to Course schedule](https://ds4ps.org/ddmp-uw-class-spring-2019/schedule/)