<br> 
<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>


## Course: Data-Driven Management and Policy

### Prof. José Manuel Magallanes, PhD 

_____

# Session 3: Data Structures





<a id='beginning'></a>

Programming languages use _data structures_ to tell the computer how to organize the data we are working with. That is, data structures provided by a programming language are not the same in another one. However, in most cases, a name given to a data structure in one programming language should generally be the same in other one. It is worth keeping in mind, that a particular data structure may serve for one purpose, but not for other ones.  

In everyday life, a book can be considered a data structure: we use it to store some kind of information. It has some advantages: it has a table of contents; it has numbers on the pages; you can take it with you; read it as long as you can see the words; and read it again as many times as you want. It has some disadvantages: you can lose it, and need to buy it again; it can deteriorate; get eaten by an insect; and so on. 

We are going to talk about 3 data structures in R:

1. [Lists.](#part1) 
2. [Vectors.](#part2) 
3. [Data Frame.](#part3) 

**Lists** and **vectors** are simple structures; a **data frame** is a more complex one (built from the simple ones). 

----

<a id='part1'></a>

## List

Lists are containers of values. The values can be of any kind (numbers or non-numbers), and even other containers (simple or complex). If we have an spreadsheet as a reference, a row is a 'natural' list.


In [87]:
DetailStudent=list("Fred Meyers",
                   40,
                   FALSE)

The *object* 'DetailStudent' serves to store _temporarily_ the list in the computer. To name a list, use combinations of letters and numbers (do not start with a number or a special character) in a meaningful way.

Typing the name of the object 'DetailStudent', now representing a list, will give you all the contents you saved in there:


In [88]:
DetailStudent

The list above has three elements. However, you may be wondering if those elements have in a meaning altogether. In those situations, it is better to have names for each elements.


In [89]:
DetailStudent=list(fullName="Fred Meyers",
                   age=40,
                   female=FALSE)

In [90]:
# seeing the result
DetailStudent

This list has three elements, which we can also call _fields_. Each of these, in this case, holds a different data type:

* *FullName* holds characters
* *age* holds a number
* *female* holds a logical (Boolean) value.

You can access any of those elements using these approaches:

In [91]:
# position
DetailStudent[[1]]

In [94]:
# name of the field
DetailStudent[['fullName']]

In [93]:
# name of the field
DetailStudent$fullName

If you do not have _names_ for the list fields, you can only access them using positions:

In [12]:
NewList=list('a','b','c','d',1,2,3)
NewList[[1]]

Once you access an element, you can alter it:

In [18]:
DetailStudent[[1]]='Alfred Mayer'
# Then:
DetailStudent

You can even add an totally NEW field like this:

In [20]:
DetailStudent$city='Seattle'

# show:
DetailStudent

And destroy it by **NULL**ing it, like this:

In [12]:
DetailStudent$city=NULL # do you like: DetailStudent[[4]]=NULL
DetailStudent

You can get rid of a list using:

In [52]:
rm(DetailStudent)
DetailStudent

ERROR: Error in eval(expr, envir, enclos): object 'DetailStudent' not found


** How would you create a list of this person out of his personal information data?**


<img src="listExample.png" alt="Drawing" style="width: 200px;"/>

In [1]:
cr7=list('FullName'='Cristiano Ronaldo dos Santos Aveiro', 
         'DateOfBirth'='5 February 1985',
         'PlaceOfBirth'='Funchal, Madeira, Portugal',
         'HeightInMeters'=1.89,
         'PlayingPosition'='Forward'
        )

In [2]:
#seeing the result:
cr7

The previous list has nothing wrong. But keep in mind that we save data to retrieve it and act (decide) upon its value. For example, can we answer the question:

* What is Ronaldo's playing position?

In [3]:
cr7$PlayingPosition

Great! However, we can not answer, directly:

* How old is he?



In [4]:
"15 April 2019" - cr7$DateOfBirth

ERROR: Error in "15 April 2019" - cr7$DateOfBirth: non-numeric argument to binary operator


We failed. 
In this situation, we should consider updating the field type:

In [33]:
# make sure the language is right

Sys.setlocale("LC_TIME", "English")

In [34]:
cr7$DateOfBirth=as.Date('5 February 1985',format="%d %B %Y")

In [36]:
# then

Sys.Date()-cr7$DateOfBirth

Time difference of 12473 days

You will need to divide:

In [49]:
# Then
(Sys.Date()-cr7$DateOfBirth)/365.2425

Time difference of 34.14991 days

In [48]:
library(lubridate)
time_length(interval(cr7$DateOfBirth,Sys.Date()),"years")

In [42]:
?time_length

[Go to page beginning](#beginning)

----

<a id='part2'></a> 

## Vectors
Vectors are also containers of values. The values should be of only __one__ type (__R__ may alter or _coerce_ them silently, otherwise). If we have an spreadsheet as a reference, a column can be a natural vector (when every value of a spreadsheet HAS to be a vector, a "bidimensional vector" is known as a **matrix**<sup><a href="#fn1" id="ref1">1</a></sup>).

Here, we will create three vectors using the "**c(...)**" function: 

In [29]:
fullnames=c("Fred Meyers","Sarah Jones", "Lou Ferrigno","Sky Turner")
ages=c(40,35, 60,77)
female=c(F,T,T,T)

Each *object* is holding temporarily a vector. Use combinations of letters and numbers (never start with a number or a special character) in a meaningful way to name a vector. When typing the name of the object you will get all the contents:

In [30]:
fullnames

In [31]:
ages

In [32]:
female

Each vector is composed of elements with the same type. If you want to access individual elements, you can write:

In [33]:
fullnames[1]

In [34]:
# or
ages[1]

In [35]:
# or
female[1]

You can alter the vector using any of the above mechanisms:

In [36]:
names[1]='Alfred Mayer'
# Then:
names[1]

You can add an element to a vector like this:

In [44]:
elements=c(1,20,3)
elements=c(elements,40) # adding to the same one
elements

You can NOT delete it with NULL:

In [45]:
elements
elements[4]=NULL

ERROR: Error in elements[4] = NULL: replacement has length zero


Just do this:

In [48]:
# by position
elements
elements2=elements[-2] # vector 'without' position 4
elements2

In [47]:
# by value
elements3=elements[elements!=20]
elements3

You can get rid of those vectors using:

In [51]:
rm(elements2)
elements2

ERROR: Error in eval(expr, envir, enclos): object 'elements2' not found


Another important operation is to get rid of repeated values:

In [24]:
weekdays=c('M','T','W','Th','S','Su','Su')
weekdays

Then, use the function _unique_:

In [25]:
unique(weekdays)

Vector elements can have 'names', but their contents still need to be homogeneous:

In [56]:
newAges=c("Sam"=50, "Paul"=30, "Name"="Jim")
newAges

As you see above, the presence of "Jim" as an element, *coerced* the other values to *characters* (the _numbers_ are now _text_, the symbol **''** is used to show that). Eliminating that value will keep the values as numeric ones in the  vector:

In [54]:
newAgesGood=c("Sam"=50, "Paul"=30)
newAgesGood

### Vectors versus Lists

Let me share some ideas for comparing these two basic structures:

__A) Make sure what you have:__

The functions **is.vector**, **is.list**, **is.character** and **is.numeric** should be used frequently, because we need to be sure of what structure we are dealing with:


In [62]:
aList=list(1,2,3)
aVector=c(1,2,3)

is.vector(aVector); is.list(aVector)

In [65]:
# then:
is.vector(aList,mode='vector'); is.list(aList)

The function **str** could be another alternative to find out what we have:


In [66]:
str(aVector)

 num [1:3] 1 2 3


In [67]:
str(aList)

List of 3
 $ : num 1
 $ : num 2
 $ : num 3


__B) Arithmetics:__

You will find great differences when doing arithmetics:

In [68]:
# if we have these vectors:
numbers1=c(1,2,3)
numbers2=c(10,20,30)
numbers3=c(5)
numbers4=c(1,10)

Then, these work well:

In [70]:
# adding element by element:
numbers1+numbers2

In [75]:
# adding 5  to all the elements of other vector:
numbers2+numbers3

In [72]:
# multiplication (element by element):
numbers1*numbers2

In [73]:
# and this kind of multiplication:
numbers1 * numbers3

However, R will give another warning here:

In [76]:
numbers1+numbers4 # different size matters!

“longer object length is not a multiple of shorter object length”

Comparissons make sense:

In [83]:
numbers1>numbers2

In [86]:
# but:
numbers1>numbers4

“longer object length is not a multiple of shorter object length”

Now, let's see how the previous operations work here. These are our lists:

In [77]:
numbersL1=list(11,22,33)
numbersL2=list(1,2,3)

...the _adding_ can not be interpreted:

In [78]:
numbersL1+numbersL2

ERROR: Error in numbersL1 + numbersL2: non-numeric argument to binary operator


... and neither the comparisons...

In [79]:
numbersL1>numbersL2

ERROR: Error in numbersL1 > numbersL2: comparison of these types is not implemented


So do not expect neither of these to work:

In [80]:
numbersL1*numbersL2

ERROR: Error in numbersL1 * numbersL2: non-numeric argument to binary operator


In [81]:
numbersL1*3

ERROR: Error in numbersL1 * 3: non-numeric argument to binary operator


[Go to page beginning](#beginning)

----

<a id='part3'></a>

## Data Frames

Data frames are containers of values. You use a data frame because you need to combine what vectors and lists do. The most common analogy is a data table like  the ones in a __spreadsheet__: 


In [54]:
# VECTORS
names=c("Qing", "Françoise", "Raúl", "Bjork")
ages=c(32,33,28,30)
country=c("China", "Senegal", "Spain", "Norway")
education=c("Bach", "Bach", "Master", "PhD")

#DF as a "List" of vectors:
students=data.frame(ages,country,education,row.names=names)
students

Unnamed: 0,ages,country,education
Qing,32,China,Bach
Françoise,33,Senegal,Bach
Raúl,28,Spain,Master
Bjork,30,Norway,PhD


You see your data frame above. Just by watching, you can not be sure of what you have, so using **str** is highly recommended:

In [55]:
str(students)

'data.frame':	4 obs. of  3 variables:
 $ ages     : num  32 33 28 30
 $ country  : Factor w/ 4 levels "China","Norway",..: 1 3 4 2
 $ education: Factor w/ 3 levels "Bach","Master",..: 1 1 2 3


This data frame uses the vector 'names' as the __row names__, so that vector is not considered a column, that is fine:

In [57]:
students[1,] #first row
students['Qing',] # row with 'Qing' as row name


Unnamed: 0,ages,country,education
Qing,32,China,Bach


Unnamed: 0,ages,country,education
Qing,32,China,Bach


In [58]:
# this is wrong: 
students['Qing']

ERROR: Error in `[.data.frame`(students, "Qing"): undefined columns selected


But the problem you should have detected is that country and education are considered of type *factor*, that is, R is coercing them as a **categorical variable**. If you do not want that, because these are proper names, you should create your data frame requesting that explicitly:

In [59]:
students=data.frame(names,ages,country,education,
                    stringsAsFactors=F)
str(students)

'data.frame':	4 obs. of  4 variables:
 $ names    : chr  "Qing" "Françoise" "Raúl" "Bjork"
 $ ages     : num  32 33 28 30
 $ country  : chr  "China" "Senegal" "Spain" "Norway"
 $ education: chr  "Bach" "Bach" "Master" "PhD"


Notice that in this new version, I am considering *names* as a column and not as the row names; then, R will use numbers in each row by default:

In [60]:
students

names,ages,country,education
Qing,32,China,Bach
Françoise,33,Senegal,Bach
Raúl,28,Spain,Master
Bjork,30,Norway,PhD


The function _str_ showed you the dimensions of the structure (number of rows and columns); R has alternative ways to get the dimensions:

In [61]:
dim(students)

#also
nrow(students)  # we have ncol() too!

# and very important:
length(students)

We know _length_ works for vectors and lists. In data frames, it gives you number of columns, NOT rows. Data frames have the functions __head()__, which is very useful to show the top rows of the dataframe:

In [62]:
head(students,2) # top 2

names,ages,country,education
Qing,32,China,Bach
Françoise,33,Senegal,Bach


Of course, we have __tail__:

In [64]:
tail(students,2) # last 2

Unnamed: 0,names,ages,country,education
3,Raúl,28,Spain,Master
4,Bjork,30,Norway,PhD


You can access data frames elements in an easy way:

In [65]:
# one particular column
students$names
# two columns using positions
students[,c(1,4)]
## two columns using names of columns
students[,c('names','education')]

names,education
Qing,Bach
Françoise,Bach
Raúl,Master
Bjork,PhD


names,education
Qing,Bach
Françoise,Bach
Raúl,Master
Bjork,PhD


Using positions is the best way to get several columns:

In [66]:
students[,c(1:3)] # ':' is used to facilitate 'from-to' sequence

names,ages,country
Qing,32,China
Françoise,33,Senegal
Raúl,28,Spain
Bjork,30,Norway


Of course, you can create a new object with **subsets**:

In [67]:
studentsNoEd=students[,c(1:3)]
studentsNoEd

names,ages,country
Qing,32,China
Françoise,33,Senegal
Raúl,28,Spain
Bjork,30,Norway


You can modify any values in a data frame. Let me create a copy of this data frame to play with:


In [68]:
studentsCopy=students # I make a copy to avoid altering my original dataframe

Now, I can change the age of Qing to 23 replacing 32:

In [69]:
studentsCopy[1,2]=23
# change is immediate! (you will not get any warning)
studentsCopy[1,]

names,ages,country,education
Qing,23,China,Bach


We can set a column as **missing**:

In [70]:
studentsCopy$country=NA

In [71]:
studentsCopy

names,ages,country,education
Qing,23,,Bach
Françoise,33,,Bach
Raúl,28,,Master
Bjork,30,,PhD


And, delete a column by **null**ing it:

In [72]:
studentsCopy$ages=NULL

In [73]:
studentsCopy

names,country,education
Qing,,Bach
Françoise,,Bach
Raúl,,Master
Bjork,,PhD


### Querying Data Frames:

Once you have a data frame you can start writing interesting queries (notice the use of _commas_):

**Who is the oldest in the group?**

In [74]:
students[which.max(students$ages),] 

Unnamed: 0,names,ages,country,education
2,Françoise,33,Senegal,Bach


**Who is the youngest in the group?**

In [75]:
students[which.min(students$ages),] 

Unnamed: 0,names,ages,country,education
3,Raúl,28,Spain,Master


**Who is above 30 and from China?**

In [76]:
students[students$ages>30 & students$country=='China',] 

names,ages,country,education
Qing,32,China,Bach


**Who is not from Norway?**

In [77]:
students[students$country!="Norway",] 

names,ages,country,education
Qing,32,China,Bach
Françoise,33,Senegal,Bach
Raúl,28,Spain,Master


**Who is from one of these?**

In [79]:
Places=c("Peru", "USA", "Spain")
students[students$country %in% Places,] 

Unnamed: 0,names,ages,country,education
3,Raúl,28,Spain,Master


In [80]:
# the opposite
students[!students$country %in% Places,] 

Unnamed: 0,names,ages,country,education
1,Qing,32,China,Bach
2,Françoise,33,Senegal,Bach
4,Bjork,30,Norway,PhD


**The education level of the one above 30 and from China?**

In [81]:
students[students$ages>30 & students$country=='China',]$education 

**Show me the data ordered by age (decreasing)?**

In [82]:

students[order(-students$ages),]



Unnamed: 0,names,ages,country,education
2,Françoise,33,Senegal,Bach
1,Qing,32,China,Bach
4,Bjork,30,Norway,PhD
3,Raúl,28,Spain,Master



----

* [Go to page beginning](#beginning)
* [Go to Course schedule](https://ds4ps.org/ddmp-uw-class-spring-2019/schedule/)

_____
### Footnotes
<sup id="fn1">1</sup>Vectors can get combined into a matrix and matrices into arrays, these are structures needed when doing some nice math. We are not covering those in the course. <a href="#ref1" >&#8593;</a>

