# Week 2

# 2. Getting data into Julia

# Loading data using Julia

<h2>Outcome</h2>

After this lecture, you will be able to
- Find data on the West African EVD epidemic online
- Use readdlm() to load data from a .csv file containing this data



<h2>Wikipedia data on the West African EVD epidemic</h2>

Wikipedia has many excellent articles on Ebola. We will be using one with fairly complete data on the timeline of cases: https://en.wikipedia.org/wiki/West_African_Ebola_virus_epidemic_timeline_of_reported_cases_and_deaths

Go there now, please, and navigate until you see a table that looks like this:

<img src="Week2_Lecture2_1-Wikipedia-EVD-cases.png" alt="(Screenshot of wikipedia table of WA EVD cases)">



We have provided the data as a file named wikipediaEVDraw.csv. The ".csv" extension indicates that it is a comma-separated file, and the "raw" in the filename indicates that the data are imported as is, without any changes.

If you would like to learn how to create .csv files from tables on the web, please go the optional lecture "How to export web tables to .csv files".

<h2>Using readdlm to load a .csv file</h2>

Now we can start using Julia again. In a new notebook for you Week 2 Julia code, enter and execute the line below:

In [9]:
using DelimitedFiles
wikiEVDraw = DelimitedFiles.readdlm("wikipediaEVDraw.csv", ',')  # getting quotes right is important!

54×9 Array{Any,2}:
 "25 Nov 2015"  28637  11314  3804  2536  …  4808     14122     3955
 "18 Nov 2015"  28634  11314  3804  2536     4808     14122     3955
 "11 Nov 2015"  28635  11314  3805  2536     4808     14122     3955
 "4 Nov 2015"   28607  11314  3810  2536     4808     14089     3955
 "25 Oct 2015"  28539  11298  3806  2535     4808     14061     3955
 "18 Oct 2015"  28476  11298  3803  2535  …  4808     14001     3955
 "11 Oct 2015"  28454  11297  3800  2534     4808     13982     3955
 "27 Sep 2015"  28388  11296  3805  2533     4808     13911     3955
 "20 Sep 2015"  28295  11295  3800  2532     4808     13823     3955
 "13 Sep 2015"  28220  11291  3792  2530     4808     13756     3953
 "6 Sep 2015"   28147  11291  3792  2530  …  4808     13683     3953
 "30 Aug 2015"  28073  11290  3792  2529     4808     13609     3953
 "16 Aug 2015"  27952  11284  3786  2524     4808     13494     3952
 ⋮                                        ⋱                     
 "9 Aug 2014"    18

The readdlm() function is Julia's way to read any file that consists of lines separated into data items with a delimeter of some sort. In fact, the very word "readdlm" is an abbreviation of "read-with-a-delimiter". 

Notice three things
- We have used a variable to contain the data from the file (you could change the name, though, if you like)
- The file name is given as a string, using double quotes
- The delimeter is given as a character, using single quotes

Finally, we see that the type of the data, after it has been stored in the variable is an array, the elements of which are of Any type. This is not good for computation---in particular, for modelling we need the data in terms of days since the start of the epidemic. Our next job is to convert the strings in columnn one into integers which give number of days since 22 March 2014.

<h1> Creating .csv from data tables on the web </h1>

Suppose you have found data in a table on the web. Here's how to extract the data save it as a .csv file to be used in Julia. We start with the Ebola data that are used in this course, which can be found on Wikipedia. 

<h5>  </h5>

<h2>          Important information: </h2>
We will use a spreadsheet, namely LibreOffice Calc, to do the work. Note that the date information on the Wikipedia site is not handled by all spreadsheets in the same way. If you use a different spreadsheet, you may find your results differ from ours.

LibreOffice Calc is available for Windows, iOS and Linux, whereas most other spreadsheets are limited to one or two of these platforms.




<h2>Wikipedia data on the West African EVD epidemic</h2>

These data are on the webpage  https://en.wikipedia.org/wiki/West_African_Ebola_virus_epidemic_timeline_of_reported_cases_and_deaths

Go there now, please, and navigate until you see a table that looks like this:

<img src="Week2_Lecture2_1-Wikipedia-EVD-cases.png" alt="(Screenshot of wikipedia table of WA EVD cases)">

<h2>Using a spreadsheet to save a website table as a .csv file</h2>

Highlight the entire table, starting with the date "25 Nov 2015" all the way to the last row, which has the date "22 Mar 2014". Make sure you highlight exactly these rows, and every one of these rows completely.

Use your browser's "Copy" feature. This is different for different browsers, but in many cases you can simply right-click and then click on "Copy".

Now open a spreadsheet (we used LibreOffice Calc), and paste the table. Again, different spreadsheets give different options. Make sure you select the option the puts the column of dates in the first column.

Now save the spreadsheet, taking care to save it as a text file with .csv format. I will refer to this file as "wikipediaEVDraw.csv" but you may of course choose your own name. If you open the file with a text editor, you should see that the first few lines of the file look like this:

<img src="Week2_Lecture2_2-Wikipedia-EVD-data.png" alt="(Screenshot of first lines of .csv file)">


Notice in particular that
- Every row in the table is a new line in the file
- The first comma is AFTER the first date
- To be exact, the columns in the table are separated by commas in the file. This is what comma-separated-value files look like, no more and no less. This is the .csv format.

<h1> "for" loops and date-time formats: converting string values into usable time data </h1>

<h2>Outcome</h2>

After this lecture, you will be able to
- Convert a string with date data to Julia's DateTime format
- Use Dates.datetime2rata() to calculate the number of days between two DateTime dates
- Use a for loop to convert the dates of the epidemic data to days since 22 March 2014
- Use writedlm to save the converted data as .csv file



We start by reading in the data:

In [10]:
wikiEVDraw = readdlm("wikipediaEVDraw.csv", ',')  # 

54×9 Array{Any,2}:
 "25 Nov 2015"  28637  11314  3804  2536  …  4808     14122     3955
 "18 Nov 2015"  28634  11314  3804  2536     4808     14122     3955
 "11 Nov 2015"  28635  11314  3805  2536     4808     14122     3955
 "4 Nov 2015"   28607  11314  3810  2536     4808     14089     3955
 "25 Oct 2015"  28539  11298  3806  2535     4808     14061     3955
 "18 Oct 2015"  28476  11298  3803  2535  …  4808     14001     3955
 "11 Oct 2015"  28454  11297  3800  2534     4808     13982     3955
 "27 Sep 2015"  28388  11296  3805  2533     4808     13911     3955
 "20 Sep 2015"  28295  11295  3800  2532     4808     13823     3955
 "13 Sep 2015"  28220  11291  3792  2530     4808     13756     3953
 "6 Sep 2015"   28147  11291  3792  2530  …  4808     13683     3953
 "30 Aug 2015"  28073  11290  3792  2529     4808     13609     3953
 "16 Aug 2015"  27952  11284  3786  2524     4808     13494     3952
 ⋮                                        ⋱                     
 "9 Aug 2014"    18

<h2>Converting a date string to DateTime format</h2>

Remarkably enough, data on dates and times are among the fiddliest things a data scientist has to deal with. There is a huge number of different ways in which such data are reported, and moreover there are conflicting standards of how to deal irregularities (month lengths aren't all the same, some years are leap years, time zones shift ...).

In consequence, every computer language that deals with date-time data has a rich array of  functions to deal with it. In Julia, they are in a package called Dates. Of this package, we will use the functions DateTime() and Dates.datetime2rata().

Why does one of them use a full stop and the other not? The answer is that when you start up Julia, only a few of the functions in the package Dates are visible. These functions include Date() but not datetime2rata(). However, we are able to access the other functions by use of the colon. We will talk more about packages in the next lecture.

The DateTime() function uses a format string convert string data such as we see in column one into Julia DateTime data.

A format string is something one sees in many computation contexts. Here, it tells Julia in what form to expect the data. Looking at the string, it is a number for the day, then space, then an abbreviation for the month, then a space, then a number for the year. The appropriate format string is therefore "d u y". These formats have limitations: "d" accepts one- and two-digit days (which should always work) and "y" accepts two- and four-digit years (which should mostly work), but "u" accepts only three-letter abbreviations. Unfortunately, data where the month names otherwise abbreviated are fairly common.

Here is an example of how the conversion works



In [15]:
using Dates
DateTime(wikiEVDraw[1,1], "d u y")

2015-11-25T00:00:00

<h2>"for" loops</h2>

Now we need to do this conversion for every element in column 1 of the matrix. The way to do this is with a "for" loop.

"for" loops are extremely important in computing, and in Julia even more so. This is because many items that are vectorised in languages like Matlab and Python are explicitly computed in "for" loops in Julia. It may surprise many of you who know about speeding up computations using vectorisation, but it is frequently the case that a loop in Julia runs *faster* than the equivalent vectorised code.

"for" loops have a simple structure: the outside is the "for ... end" part and the inside is a code block executed repeatedly. Exactly how many times is determined by the the "for ... end" part.

In the two examples below, we use println() to show the value of the variable over which the "for" loop runs. Notice that these values do not have to be a sequence of integers.



In [16]:
for num = 3:7    # here, the colon is used to specify a range; we will see this again
    println("num is now $num")
end

values = [23, "my name is not a name", 'ℵ']      # an array with some rather odd elements
for x in values    # a for loop can iterate over an array
        println("The value of x is now $x")
    end

num is now 3
num is now 4
num is now 5
num is now 6
num is now 7
The value of x is now 23
The value of x is now my name is not a name
The value of x is now ℵ


It is important to get the first line of a "for" loop exactly right. It has the structure 

"variable = iterable"

Here, "iterable" is anything that is arranged in a sequence in memory. Not all types are, but they certainly include ranges (created with the colon operator ":") and any single dimension of an array. The "=" is an assignment operator, and it assigns to "variable" the values in "iterable", one after the other. That is, during each pass through the loop, "variable" has the value of exactly one of the items in "iterable". 

<h2>Converting column 1 DateTime type</h2>

Now we use a "for" loop twice. Firstly we create a one-dimensional array containing just column one---it uses array slicing, for conversion to values with DateTime type.



In [29]:
col1 = wikiEVDraw[:, 1]  # the colon means all the data in the column, the 1 means the first column

54-element Array{Any,1}:
 "25 Nov 2015"
 "18 Nov 2015"
 "11 Nov 2015"
 "4 Nov 2015"
 "25 Oct 2015"
 "18 Oct 2015"
 "11 Oct 2015"
 "27 Sep 2015"
 "20 Sep 2015"
 "13 Sep 2015"
 "6 Sep 2015"
 "30 Aug 2015"
 "16 Aug 2015"
 ⋮
 "9 Aug 2014"
 "30 Jul 2014"
 "23 Jul 2014"
 "14 Jul 2014"
 "2 Jul 2014"
 "17 Jun 2014"
 "27 May 2014"
 "12 May 2014"
 "1 May 2014"
 "14 Apr 2014"
 "31 Mar 2014"
 "22 Mar 2014"

In [30]:
for i = 1:54
    col1[i] = DateTime(col1[i], "d u y")  # note that this replaces the previous value in col1[i]
end

In [31]:
col1

54-element Array{Any,1}:
 2015-11-25T00:00:00
 2015-11-18T00:00:00
 2015-11-11T00:00:00
 2015-11-04T00:00:00
 2015-10-25T00:00:00
 2015-10-18T00:00:00
 2015-10-11T00:00:00
 2015-09-27T00:00:00
 2015-09-20T00:00:00
 2015-09-13T00:00:00
 2015-09-06T00:00:00
 2015-08-30T00:00:00
 2015-08-16T00:00:00
 ⋮
 2014-08-09T00:00:00
 2014-07-30T00:00:00
 2014-07-23T00:00:00
 2014-07-14T00:00:00
 2014-07-02T00:00:00
 2014-06-17T00:00:00
 2014-05-27T00:00:00
 2014-05-12T00:00:00
 2014-05-01T00:00:00
 2014-04-14T00:00:00
 2014-03-31T00:00:00
 2014-03-22T00:00:00

<h2>Creating data giving time in days since 22 March 2014</h2>


Finally, we create the variable "epidays". This calls to mind the concept of *epidemic day*, which is simply a way to indicate how long an epidemic has been running. We will assume that the epidemic started on 22 March 2014, with a total of 49 cases, because that is the  first date for which we have data.


Note that this is in keeping with the spirit of modelling: we are trying to do the best we can with the data we have. Even when we know that the epidemic has been traced back to a single case in early December 2013, that information is not in the table of data before us. We should not forget about it, but neither should we attempt to include it in the data.

The function we use is Dates.datetime2rata(). The "Rata Die days" format is a specialised date format we will not discuss here (see https://en.wikipedia.org/wiki/Rata_Die for information). The important thing is that this function, applied to a given date, gives the number of days since 1 January of the year 0001. As follows:



In [32]:
Dates.datetime2rata(col1[1])

735927

<h2>Exporting the converted data</h2>


We create a function to express the number of days since epidemic day zero, which is the value of col[54] which is of course 22 March 2014.

Then we iterate that function with a for loop over all the elements in col1 to create epidays. Note that the variable epidays is created before the start of the loop. This is, in general, good practice: if you know what array you want to fill, then initialise that array before you start filling it.

In [37]:
dayssincemar22(x) = Dates.datetime2rata(x) - Dates.datetime2rata(col1[54])
epidays = Array{Int64}(undef,54)
for i = 1:54
    epidays[i] = dayssincemar22(col1[i])
end

Finally, we overwite column 1 of our data array with epidays, and save it using writedlm(). It is a good idea to use a new filename, so that all the work that went into extracting the data from wikipedia is not lost. You never know when you might want the original dates again!

In [38]:
wikiEVDraw[:, 1] = epidays
writedlm("wikipediaEVDdatesconverted.csv", wikiEVDraw, ',')  
#         note the delimiter ... the Julia default is a tab; to get .csv, we must specify the comma

# Practice Quiz


<h2 id="int"> Question 1      </h2>

Which line of Julia code is used to load data from a csv file into a variable?

A csv file refers to a simple file format used for storing tabular data, such as a spreadsheet or data sets. The letters CSV stands for "comma-separated values". For example a csv file with 2 rows and 3 columns of numbers might appear something like:

153,455,2322

3,96,-76

a. readdlm("data.csv", ',')

b. data= readdlm("data.csv", ',')

c. data= readdlm("data.csv", '#')

Double-click __here__ for the solution.

<!-- Your answer is below:
readdlm("data.csv", ',')
Answer: A 
--> 


<h2 id="int"> Question 2      </h2>

Which line of Julia code would return the correct date and time given the string "2015 Nov 25"?

a. using Dates

Dates.DateTime("2015 Nov 25", "y u d")

b. using Dates

Dates.DateTime("2015 Nov 25", "u y d")

c. using Dates

Dates.DateTime("2015 Nov 25", "d u y")

Double-click __here__ for the solution.

<!-- Your answer is below:
using Dates
Dates.DateTime("2015 Nov 25", "y u d")
Answer: A 
--> 


<h2 id="int"> Question 3     </h2>

Consider the following Julia code fragment. The “for” loop is repeated for each value i in the array mylist. Then mylist[i] is set to the current value of count.

mylist = [3, 2, 1]

count=1

for i in mylist

  mylist[i]=count
  
  count=count+1
  
end

What is the value of mylist[3] at the end of this loop?

Double-click __here__ for the solution.

<!-- Your answer is below:
mylist = [3, 2, 1]

count=1

for i in mylist

  mylist[i]=count
  
  count=count+1
  
end

mylist[3]
1
Answer: 1
--> 


<h2 id="int"> Question 4      </h2>

Consider the following Julia code fragment.

count=0

for i=1:3

  for j=1:3
  
    count=count+1
    
  end
  
end

What is the value of count after both loops have finished?

Double-click __here__ for the solution.

<!-- Your answer is below:
count=0

for i=1:3

  for j=1:3
  
    count=count+1
    
  end
  
end
count
9
Answer: 9
--> 


<h2 id="int"> Question 5      </h2>

Which line of code would return the seventh row in a two dimensional array variable called data?

a. data[7, :]

b. data[6, :]

c. data[:, 7]

d. data[7,]

Double-click __here__ for the solution.

<!-- Your answer is below:
data[7,:]
Answer: A 
--> 


<h2 id="int"> Question 6     </h2>

What value is returned by running the following code in Julia?

summedvals = 3

for k = 1:2:5 

  summedvals = summedvals + k
  
end

summedvals


Double-click __here__ for the solution.

<!-- Your answer is below:
summedvals = 3
for k = 1:2:5
    summedvals = summedvals + k
end
summedvals
12
Answer: 12 
--> 