# I/O, Networking, and Parallel Computing

In this chapter, we will explore how Julia interacts with the outside world, reading from
standard input and writing to standard output, files, networks, and databases. Julia
provides asynchronous networking I/O using the libuv library. We will see how to handle
data in Julia. We will also explore Julia's parallel processing model.

In this chapter, the following topics are covered:
- Basic input and output
- Working with files (including CSV files)
- Using DataFrames
- Working with TCP sockets and servers
- Interacting with databases
- Parallel operations and computing

# Basic input and output

Julia's vision on input/output (I/O) is stream-oriented, that is, reading or writing streams of
bytes. We will introduce different types of stream, file streams, in this chapter. Standard
input (stdin) and standard output (stdout) are constants of the TTY type (an abbreviation
for the old term, Teletype) that can be read from and written to in Julia code (refer to the
code in Chapter 8\io.jl):

- read(stdin, Char): This command waits for a character to be entered, and
then returns that character; for example, when you type in J, this returns 'J'

In [34]:
a = read(stdin, String)

""

- write(stdout, "Julia"): This command types out Julia5 (the added 5 is
the number of bytes in the output stream; it is not added if the command ends in
a semicolon ;)

In [25]:
write(stdout, "WTF");

WTF

In [26]:
a = read(stdin, String)

""

stdin and stdout are simply streams and can be replaced by any stream object in
read/write commands. readbytes is used to read a number of bytes from a stream into a
vector:

In [28]:
a = @async read(stdin, 10)

Task (done) @0x00007f0479d74b90

readline(stdin): This command reads all the input until a newline
character, \n, is entered. For example, type Julia and press Enter; this returns
"Julia\r\n" on Windows and "Julia\n" on Linux.

In [38]:
readline()

stdin>  What the heck is going on here 


"What the heck is going on here "

If you need to read all the lines from an input stream, use the eachline method in a for
loop, for example:

In [42]:
stream = stdin
for lines in eachline(stream)
    print(lines)
end


To test whether you have reached the end of an input stream, use eof(stream) in
combination with a while loop, as follows:

In [44]:
# while !eof(stream)
# x = read(stream, Char)
# println("Found: $x")
# # process the character
# end

# Working with files

To work with files, we need the IOStream type. IOStream is a type with the IO supertype
and has the following characteristics:

In [52]:
fieldnames(IOStream)

(:handle, :ios, :name, :mark, :lock, :_dolock)

In [51]:
IOStream.types

svec(Ptr{Nothing}, Vector{UInt8}, String, Int64, ReentrantLock, Bool)

The file handle is a pointer of the Ptr type, which is a reference to the file object.

In [53]:
file = open("/home/javid/quotes/Quotes")

IOStream(<file /home/javid/quotes/Quotes>)

In [56]:
typeof(file)

IOStream

In [57]:
data = readlines(file)

1275-element Vector{String}:
 "why don't you grovel at my feet (grovel: crawel,beg)"
 "why must i stoop so low? (stoop: act of bending the body forward and backward)"
 "you make it sound you and I are an Item"
 "ever since i reached puberty ... (adulthoot)"
 "it's as though we added fuel to fire"
 "did you think you could elude me? (elude: evade or escape from)"
 "i will fleece you( fleece :the wool coat of a sheep)"
 "beat your head to a pulp (pulp: soft, wet shapeless)"
 "soil his fur with blood (soil: make dirty)"
 "your locks are just like father's (locks:  a piece of a person's hair that coils or hangs together)"
 "stop quibbling over small matters ( quibble: argue about a trivial matter. )"
 "my hucnch was right(hunch:  a feeling or guess based on intuition. )"
 "they've come to bid a final farewell (bid: offer (a certain price) for something, especially at an auction.)"
 "why would you get so sappy on me(sappy: over semtimental)"
 "i've been missing all my cram classess for high

In [58]:
length(data)

1275

In [63]:
for l in data[1:10]
    println(l)
end

why don't you grovel at my feet (grovel: crawel,beg)
why must i stoop so low? (stoop: act of bending the body forward and backward)
you make it sound you and I are an Item
ever since i reached puberty ... (adulthoot)
it's as though we added fuel to fire
did you think you could elude me? (elude: evade or escape from)
i will fleece you( fleece :the wool coat of a sheep)
beat your head to a pulp (pulp: soft, wet shapeless)
soil his fur with blood (soil: make dirty)
your locks are just like father's (locks:  a piece of a person's hair that coils or hangs together)


Always close the IOStream object to clean and save resources. If you want to read the file
into one string, use readall (for example, see the word_frequency program in Chapter
5, Collection Types). Use this only for relatively small files because of the memory
consumption; this can also be a potential problem when using readlines.

There is a convenient shorthand with the do syntax for opening a file, applying a function
process, and closing it automatically. This goes as follows (file is the IOStream object in
this code):

In [64]:
file_path = "/home/javid/quotes/Quotes"

"/home/javid/quotes/Quotes"

In [65]:
open(file_path) do file
    data = readlines(file)
end


1275-element Vector{String}:
 "why don't you grovel at my feet (grovel: crawel,beg)"
 "why must i stoop so low? (stoop: act of bending the body forward and backward)"
 "you make it sound you and I are an Item"
 "ever since i reached puberty ... (adulthoot)"
 "it's as though we added fuel to fire"
 "did you think you could elude me? (elude: evade or escape from)"
 "i will fleece you( fleece :the wool coat of a sheep)"
 "beat your head to a pulp (pulp: soft, wet shapeless)"
 "soil his fur with blood (soil: make dirty)"
 "your locks are just like father's (locks:  a piece of a person's hair that coils or hangs together)"
 "stop quibbling over small matters ( quibble: argue about a trivial matter. )"
 "my hucnch was right(hunch:  a feeling or guess based on intuition. )"
 "they've come to bid a final farewell (bid: offer (a certain price) for something, especially at an auction.)"
 "why would you get so sappy on me(sappy: over semtimental)"
 "i've been missing all my cram classess for high

In [70]:
open(file_path) do file
    for (i, line) in enumerate(eachline(file))
        println(line) # or process line
        if i > 10 break end
    end
end

why don't you grovel at my feet (grovel: crawel,beg)
why must i stoop so low? (stoop: act of bending the body forward and backward)
you make it sound you and I are an Item
ever since i reached puberty ... (adulthoot)
it's as though we added fuel to fire
did you think you could elude me? (elude: evade or escape from)
i will fleece you( fleece :the wool coat of a sheep)
beat your head to a pulp (pulp: soft, wet shapeless)
soil his fur with blood (soil: make dirty)
your locks are just like father's (locks:  a piece of a person's hair that coils or hangs together)
stop quibbling over small matters ( quibble: argue about a trivial matter. )


# Reading and writing CSV files

A CSV file is a comma-separated file. The data fields in each line are separated by commas,
,, or another delimiter, such as semicolons, ;. These files are the de-facto standard for
exchanging small and medium amounts of tabular data

uch files are structured so that
one line contains data about one data object, so we need a way to read and process the file
line by line.

In general, the readdlm function from the DelimitedFiles package is used to read in the
data from the CSV files:

In [1]:
using DelimitedFiles

In [9]:
file_path = "../../Julia-1-Programming-Complete-Reference-Guide-master/Chapter08/winequality.csv"

"../../Julia-1-Programming-Complete-Reference-Guide-master/Chapter08/winequality.csv"

In [10]:
data = DelimitedFiles.readdlm(file_path, ';')

1600×12 Matrix{Any}:
   "fixed acidity;\"volatile acidity\";\"citric acid\";\"residua" ⋯ 58 bytes ⋯ "ioxide\";\"density\";\"pH\";\"sulphates\";\"alcohol\";\"quality\""  …   ""   ""      ""     ""   ""       ""    ""     ""   ""
  7.4                                                                                                                                                                                                  1.9  0.076  11     34    0.9978   3.51  0.56   9.4  5
  7.8                                                                                                                                                                                                  2.6  0.098  25     67    0.9968   3.2   0.68   9.8  5
  7.8                                                                                                                                                                                                  2.3  0.092  15     54    0.997    3.26  0.65   9.8  5
 11.2        

In [11]:
size(data)

(1600, 12)

In [12]:
length(data)

19200

The problem with what we have done so far is that the header (the column titles) was read
as part of the data. Fortunately, we can pass the header=true argument to let Julia put the
first line in a separate array.

In [14]:
data, header = DelimitedFiles.readdlm(file_path, ';', header=true)

([7.4 0.7 … 9.4 5.0; 7.8 0.88 … 9.8 5.0; … ; 5.9 0.645 … 10.2 5.0; 6.0 0.31 … 11.0 6.0], AbstractString["fixed acidity;\"volatile acidity\";\"citric acid\";\"residual sugar\";\"chlorides\";\"free sulfur dioxide\";\"total sulfur dioxide\";\"density\";\"pH\";\"sulphates\";\"alcohol\";\"quality\"" "" … "" ""])

In [20]:
header[1]

"fixed acidity;\"volatile acidity\";\"citric acid\";\"residual sugar\";\"chlorides\";\"free sulfur dioxide\";\"total sulfur dioxide\";\"density\";\"pH\";\"sulphates\";\"alcohol\";\"quality\""

In [23]:
header[1]

"fixed acidity;\"volatile acidity\";\"citric acid\";\"residual sugar\";\"chlorides\";\"free sulfur dioxide\";\"total sulfur dioxide\";\"density\";\"pH\";\"sulphates\";\"alcohol\";\"quality\""

It then naturally gets the correct datatype, Float64, for the
data array. We can also specify the type explicitly, such as this:

In [24]:
methods(DelimitedFiles.readdlm)

In [28]:
data,header = DelimitedFiles.readdlm(file_path, ';',Float64, '\n' , header=true)

([7.4 0.7 … 9.4 5.0; 7.8 0.88 … 9.8 5.0; … ; 5.9 0.645 … 10.2 5.0; 6.0 0.31 … 11.0 6.0], AbstractString["fixed acidity;\"volatile acidity\";\"citric acid\";\"residual sugar\";\"chlorides\";\"free sulfur dioxide\";\"total sulfur dioxide\";\"density\";\"pH\";\"sulphates\";\"alcohol\";\"quality\"" "" … "" ""])

In [34]:
? DelimitedFiles.readdlm

```
readdlm(source, T::Type; options...)
```

The columns are assumed to be separated by one or more whitespaces. The end of line delimiter is taken as `\n`.

# Examples

```jldoctest
julia> using DelimitedFiles

julia> x = [1; 2; 3; 4];

julia> y = [5; 6; 7; 8];

julia> open("delim_file.txt", "w") do io
           writedlm(io, [x y])
       end;

julia> readdlm("delim_file.txt", Int64)
4×2 Matrix{Int64}:
 1  5
 2  6
 3  7
 4  8

julia> readdlm("delim_file.txt", Float64)
4×2 Matrix{Float64}:
 1.0  5.0
 2.0  6.0
 3.0  7.0
 4.0  8.0

julia> rm("delim_file.txt")
```

---

```
readdlm(source, delim::AbstractChar, T::Type; options...)
```

The end of line delimiter is taken as `\n`.

# Examples

```jldoctest
julia> using DelimitedFiles

julia> x = [1; 2; 3; 4];

julia> y = [1.1; 2.2; 3.3; 4.4];

julia> open("delim_file.txt", "w") do io
           writedlm(io, [x y], ',')
       end;

julia> readdlm("delim_file.txt", ',', Float64)
4×2 Matrix{Float64}:
 1.0  1.1
 2.0  2.2
 3.0  3.3
 4.0  4.4

julia> rm("delim_file.txt")
```

---

```
readdlm(source; options...)
```

The columns are assumed to be separated by one or more whitespaces. The end of line delimiter is taken as `\n`. If all data is numeric, the result will be a numeric array. If some elements cannot be parsed as numbers, a heterogeneous array of numbers and strings is returned.

# Examples

```jldoctest
julia> using DelimitedFiles

julia> x = [1; 2; 3; 4];

julia> y = ["a"; "b"; "c"; "d"];

julia> open("delim_file.txt", "w") do io
           writedlm(io, [x y])
       end;

julia> readdlm("delim_file.txt")
4×2 Matrix{Any}:
 1  "a"
 2  "b"
 3  "c"
 4  "d"

julia> rm("delim_file.txt")
```

---

```
readdlm(source, delim::AbstractChar; options...)
```

The end of line delimiter is taken as `\n`. If all data is numeric, the result will be a numeric array. If some elements cannot be parsed as numbers, a heterogeneous array of numbers and strings is returned.

# Examples

```jldoctest
julia> using DelimitedFiles

julia> x = [1; 2; 3; 4];

julia> y = [1.1; 2.2; 3.3; 4.4];

julia> open("delim_file.txt", "w") do io
           writedlm(io, [x y], ',')
       end;

julia> readdlm("delim_file.txt", ',')
4×2 Matrix{Float64}:
 1.0  1.1
 2.0  2.2
 3.0  3.3
 4.0  4.4

julia> z = ["a"; "b"; "c"; "d"];

julia> open("delim_file.txt", "w") do io
           writedlm(io, [x z], ',')
       end;

julia> readdlm("delim_file.txt", ',')
4×2 Matrix{Any}:
 1  "a"
 2  "b"
 3  "c"
 4  "d"

julia> rm("delim_file.txt")
```

---

```
readdlm(source, delim::AbstractChar, eol::AbstractChar; options...)
```

If all data is numeric, the result will be a numeric array. If some elements cannot be parsed as numbers, a heterogeneous array of numbers and strings is returned.

---

```
readdlm(source, delim::AbstractChar, T::Type, eol::AbstractChar; header=false, skipstart=0, skipblanks=true, use_mmap, quotes=true, dims, comments=false, comment_char='#')
```

Read a matrix from the source where each line (separated by `eol`) gives one row, with elements separated by the given delimiter. The source can be a text file, stream or byte array. Memory mapped files can be used by passing the byte array representation of the mapped segment as source.

If `T` is a numeric type, the result is an array of that type, with any non-numeric elements as `NaN` for floating-point types, or zero. Other useful values of `T` include `String`, `AbstractString`, and `Any`.

If `header` is `true`, the first row of data will be read as header and the tuple `(data_cells, header_cells)` is returned instead of only `data_cells`.

Specifying `skipstart` will ignore the corresponding number of initial lines from the input.

If `skipblanks` is `true`, blank lines in the input will be ignored.

If `use_mmap` is `true`, the file specified by `source` is memory mapped for potential speedups if the file is large. Default is `false'. On a Windows filesystem,`use_mmap`should not be set to`true` unless the file is only read once and is also not written to. Some edge cases exist where an OS is Unix-like but the filesystem is Windows-like.

If `quotes` is `true`, columns enclosed within double-quote (") characters are allowed to contain new lines and column delimiters. Double-quote characters within a quoted field must be escaped with another double-quote.  Specifying `dims` as a tuple of the expected rows and columns (including header, if any) may speed up reading of large files.  If `comments` is `true`, lines beginning with `comment_char` and text following `comment_char` in any line are ignored.

# Examples

```jldoctest
julia> using DelimitedFiles

julia> x = [1; 2; 3; 4];

julia> y = [5; 6; 7; 8];

julia> open("delim_file.txt", "w") do io
           writedlm(io, [x y])
       end

julia> readdlm("delim_file.txt", '\t', Int, '\n')
4×2 Matrix{Int64}:
 1  5
 2  6
 3  7
 4  8

julia> rm("delim_file.txt")
```


Let's continue working with variable data. The data forms a matrix, and we can get the
rows and columns of data using the normal array-matrix syntax (refer to the Matrices
section in Chapter 5, Collection Types).

In [35]:
data[2,:]

12-element Vector{Float64}:
  7.8
  0.88
  0.0
  2.6
  0.098
 25.0
 67.0
  0.9968
  3.2
  0.68
  9.8
  5.0

In [49]:
header[1]

"fixed acidity;\"volatile acidity\";\"citric acid\";\"residual sugar\";\"chlorides\";\"free sulfur dioxide\";\"total sulfur dioxide\";\"density\";\"pH\";\"sulphates\";\"alcohol\";\"quality\""

In [55]:
split(header[1],";")

12-element Vector{SubString{String}}:
 "fixed acidity"
 "\"volatile acidity\""
 "\"citric acid\""
 "\"residual sugar\""
 "\"chlorides\""
 "\"free sulfur dioxide\""
 "\"total sulfur dioxide\""
 "\"density\""
 "\"pH\""
 "\"sulphates\""
 "\"alcohol\""
 "\"quality\""

To get a matrix with the data from columns 3, 6, and 11, execute the following command:

In [56]:
z = [data[:,3] data[:,6] data[:,11]]

1599×3 Matrix{Float64}:
 0.0   11.0   9.4
 0.0   25.0   9.8
 0.04  15.0   9.8
 0.56  17.0   9.8
 0.0   11.0   9.4
 0.0   13.0   9.4
 0.06  15.0   9.4
 0.0   15.0  10.0
 0.02   9.0   9.5
 0.36  17.0  10.5
 0.08  15.0   9.2
 0.36  17.0  10.5
 0.0   16.0   9.9
 0.29   9.0   9.1
 0.18  52.0   9.2
 ⋮           
 0.44  24.0  11.6
 0.44  22.0  11.5
 0.41  34.0  11.4
 0.11  18.0  10.9
 0.33  34.0  12.8
 0.2   29.0   9.2
 0.15  26.0  11.6
 0.09  16.0  11.6
 0.13  29.0  11.0
 0.08  28.0   9.5
 0.08  32.0  10.5
 0.1   39.0  11.2
 0.13  29.0  11.0
 0.12  32.0  10.2
 0.47  18.0  11.0

To write to a CSV file, the simplest way is to use the writecsv function for a comma
separator, or the writedlm function if you want to specify another separator. For example,
to write an array data to a partial.dat file, you need to execute the following command:

In [57]:
data = rand(1:100,(100,100))

100×100 Matrix{Int64}:
 20  35  39   50  11   8   5  99  72  40  74   5  57  44  52  …   9  68  24   87  27   51   42  81   55  71   6  73  93  19
  3  18  97    6  90  45  29  56  86  23  97  53  85  96  83     16  11  28   66  42   53   37  77   34   7  32  22  89  50
 13  79  22   23  44  78  43  72  43   1  71  54  74  53  15     38  24  39   68  51   46   86  22   61  19  47  90  74  40
 23  55  37   92   8  61  21  39  58  28  42  82  70  54  27      1  85  14   88  66   76   35   9   60   5  22  18  73  30
 45  84  94   22  77  18  35  27  54  42  33  24  91  46  27      8  82  18   25  85   12   10  72   43  26  61  12  51  13
 58  54  78   38  83  57  76  57  55  39  75   1  11  76  28  …  27   4  68   98  98   63   10  38   10  24  67  30   5  48
 36  85  28   72  70  83  94  41  30  50  67  53  96  61  75     51  20  71   66  49   19   88  76   25  19  62  58  15  14
 78   8  39   93  75  78  86  85  80  48  53  91  74  50   6     69   7  70   85  82   75   70  93   90  67  

In [58]:
size(data)

(100, 100)

In [60]:
DelimitedFiles.writedlm("RandomMatrix.csv",data,';')

If more control is necessary, you can easily combine the more basic functions from the
previous section. For example, the following code snippet writes 10 tuples of three numbers
each to a file:

In [68]:
a = []

Any[]

In [73]:
push!(a,[1,2,3]...)

4-element Vector{Any}:
  [1, 2, 3]
 1
 2
 3

In [74]:
a

4-element Vector{Any}:
  [1, 2, 3]
 1
 2
 3

In [66]:
;rm RandomMatrix.csv

In [126]:
open("RandomMatrix.csv", 'rw') do file 
    headers = collect(1:100);
    data = rand(1:1000,(100,100));
    data_str = [join(headers,";")]
    push!(data_str,[join(line, ";") for line in eachcol(data)]...);
    #     write(file,data_str)
#     write(file, join(data_str,'\n'))


end





LoadError: MethodError: no method matching open(::String, ::Char)
[0mClosest candidates are:
[0m  open(::AbstractString; lock, read, write, create, truncate, append) at /opt/julia-1.7.1/share/julia/base/iostream.jl:275
[0m  open(::AbstractString, [91m::AbstractString[39m; lock) at /opt/julia-1.7.1/share/julia/base/iostream.jl:354
[0m  open([91m::Function[39m, ::Any...; kwargs...) at /opt/julia-1.7.1/share/julia/base/io.jl:327

# Using DataFrames

DataFrame is the most natural representation to work with such a (m x n) table of data.

They are similar to Pandas DataFrames in Python or data.frame in R. DataFrame is a
more specialized tool than a normal array for working with tabular and statistical data, and
it is defined in the DataFrames package, a popular Julia library for statistical work

A common case in statistical data is that data values can be missing (the information is not
known). The Missings package provides us with a unique value, missing, which
represents a non-existing value, and has the Missing type. The result of the computations
that contain the missing values mostly cannot be determined, for example, 42 + missing
returns missing.

DataFrame is a kind of in-memory database, versatile in the various ways you can work
with data. It consists of columns with names such as Col1, Col2, and Col3. All of these
columns are DataArrays that have their own type, and the data they contain can be
referred to by the column names as well, so we have substantially more forms of indexing.

Unlike two-dimensional arrays, columns in DataFrame can be of different types. One
column might, for instance, contain the names of students and should therefore be a string.
Another column could contain their age and should be an integer.

In [2]:
using DataFrames, Missings

In [9]:
df = DataFrame()

In [10]:
df[!, :Col1] = 1:4
df[!, :Col2] = [exp(1), pi, sqrt(2), 42]
df[!, :Col3] = [true, false, true, false]
show(df)

[1m4×3 DataFrame[0m
[1m Row [0m│[1m Col1  [0m[1m Col2     [0m[1m Col3  [0m
[1m     [0m│[90m Int64 [0m[90m Float64  [0m[90m Bool  [0m
─────┼────────────────────────
   1 │     1   2.71828   true
   2 │     2   3.14159  false
   3 │     3   1.41421   true
   4 │     4  42.0      false

In [12]:
df

Unnamed: 0_level_0,Col1,Col2,Col3
Unnamed: 0_level_1,Int64,Float64,Bool
1,1,2.71828,1
2,2,3.14159,0
3,3,1.41421,1
4,4,42.0,0


show(df) produces a nicely formatted output (whereas show(:Col2) does not). This is
because there is a show() routine defined in the package for the entire contents of
DataFrame.

We could also have used the full constructor, as follows:

In [15]:
df = DataFrame(col1=1:10, col2=11:20)

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,Int64
1,1,11
2,2,12
3,3,13
4,4,14
5,5,15
6,6,16
7,7,17
8,8,18
9,9,19
10,10,20


You can refer to columns either by an index (the column number) or by a name; both of the
following expressions return the same output:

In [20]:
show(df[!, 2])

[11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

In [23]:
df[!, 1:2]

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,Int64
1,1,11
2,2,12
3,3,13
4,4,14
5,5,15
6,6,16
7,7,17
8,8,18
9,9,19
10,10,20


In [24]:
df[2:4, 1:2]

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,Int64
1,2,12
2,3,13
3,4,14


In [25]:
df[2:9, :col2]

8-element Vector{Int64}:
 12
 13
 14
 15
 16
 17
 18
 19

In [27]:
df[3:9, [:col1, :col2]]

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,Int64
1,3,13
2,4,14
3,5,15
4,6,16
5,7,17
6,8,18
7,9,19


### The following functions are very useful when working with DataFrames:

- The names function gives the names of the names(df) columns.

In [35]:
names(df)

2-element Vector{String}:
 "col1"
 "col2"

- The eltypes function gives the data types of the eltypes(df) columns. It gives
the output as 3-element Array{Type{T<:Top},1}: Int64 Float64 Bool.

In [36]:
eltype(df)

Any

- The describe function tries to give some useful summary information about the
data in the columns, depending on the type.

In [37]:
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Int64,Float64,Int64,Int64,DataType
1,col1,5.5,1,5.5,10,0,Int64
2,col2,15.5,11,15.5,20,0,Int64


To load in data from a local CSV file, use the read method from the CSV package (the
following are the docs for that package: https://juliadata.github.io/CSV.jl/stable/).
The returned object is of the DataFrame type:

In [38]:
using CSV

In [43]:
file_name = "RandomMatrix.csv"

"RandomMatrix.csv"

In [45]:
a = CSV.read(file_name, DataFrame, delim=';')

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,530,827,889,260,142,97,529,140,666,36,761,518
2,990,22,987,824,91,969,42,279,267,718,1,925
3,871,510,826,763,938,591,448,322,78,586,233,1
4,210,872,756,572,944,124,823,545,642,431,576,611
5,693,364,515,893,98,369,882,734,611,247,828,623
6,342,835,731,474,608,86,329,108,710,743,854,259
7,79,600,213,330,636,628,157,922,47,597,273,833
8,672,815,440,661,474,13,947,127,33,957,449,847
9,790,492,828,914,661,177,626,354,456,568,722,66
10,228,482,671,745,974,537,61,969,30,990,432,285


In [47]:
size(a)

(100, 100)

Writing DataFrame to a file can be done with the CSV.write function, which takes the
filename and DataFrame as arguments, for example, CSV.write ("dataframe1.csv",
df, delim = ';')

In [52]:
CSV.write("RandomMatrix2.csv", a, delim=';')

"RandomMatrix2.csv"

In [55]:
a[!, :quality] = 1:100

1:100

In [95]:
hist(a)

LoadError: UndefVarError: hist not defined

# Other file formats

- For JSON, use the JSON package. The parse method converts JSON strings into
- dictionaries, and the json method turns any Julia object into a JSON string.
- For XML, use the LightXML package.
- For YAML, use the YAML package.
- For HDF5 (a common format for scientific data), use the HDF5 package.
- For working with Windows INI files, use the IniFile package.

# Working with TCP sockets and servers

To send data over a network, the data has to conform to a certain format or protocol. The
Transmission Control Protocol / Internet Protocol (TCP/IP) is one of the core protocols to
be used on the internet.

In [96]:
using Sockets

In [97]:
server = Sockets.listen(8080)

Sockets.TCPServer(RawFD(48) active)

In [98]:
conn = accept(server)

TCPSocket(RawFD(49) open, 0 bytes waiting)

In [99]:
line = readline(conn)

""

In [103]:
close(conn)

In [None]:
using Sockets
server = Sockets.listen(8085)
while true
    conn = accept(server)
    @async begin
        try
            while true
                line = readline(conn)
                println(line) # output in server console
                write(conn,line)
            end
        catch ex
            print("connection ended with error $ex")
        end
    end # end coroutine block
end


To achieve this, we place the accept() function within an infinite while loop, so that each
incoming connection is accepted. The same is true for reading and writing to a specific
client; the server only stops listening to that client when the client disconnects. Because the
network communication with the clients is a possible source of errors, we have to surround
it within a try/catch expression

However, we also see an @async macro here; what is its function? The @async macro starts
a new coroutine (refer to the Tasks section in Chapter 4, Control Flow) in the local process to
handle the execution of the begin...end block that starts right after it. So, the @async
macro handles the connection with each particular client in a separate coroutine.

On the other hand, the @sync macro is used to enclose a number of @async (or @spawn or
@parallel calls, refer to the Parallel operations and computing section), and the code
execution waits at the end of the @sync block until all the enclosed calls are finished.

In [1]:
using Sockets

In [None]:
server = Sockets.listen(9092)
while true 
    connection = accept(server)
    @async  begin 
        while true
            try 
                message = readline(connection)
                println(message)
                if message == "Die" 
                    write(connection,"Bye Bye :( I'm gonna Die")
                    close(connection)
                    break
                end
            catch ex
                print(ex)
            end
        end
    end
end

    

The listen function has some variants, for example, listen(IPv6(0),2001) creates a
TCP server that listens on port 2001 on all IPv6 interfaces. Similarly, instead of readline,
there are also simpler read methods:

- read(conn, UInt8): This method blocks until there is a byte to read from
conn, and then returns it. Use convert(Char, n) to convert a UInt8 value into
Char. This will let you see the ASCII letter for UInt8 you read in.
- read(conn, Char): This method blocks until there is a byte to read from conn,
and then returns it.

The important aspect about the communication API is that the code looks like synchronous
code executing line by line, even though the I/O is actually happening asynchronously
through the use of tasks. We don't have to worry about writing callbacks as in some other
languages.

# Interacting with databases

Open Database Connectivity (ODBC) is a low-level protocol for establishing connections
with the majority of databases and datasources ( for more details, refer to
http://en.wikipedia.org/wiki/Open_Database_Connectivity).

Julia has an ODBC package that enables Julia scripts to talk to ODBC data sources. Install the
package through Pkg.add("ODBC"), and at the start of the code, run it using ODBC.

The package can work with a system Data Source Name (DSN) that contains all the
concrete connection information, such as server name, database, credentials, and so on.
Every operating system has its own utility to make DSNs. In Windows, the ODBC
administrator can be reached by navigating to Control Panel | Administrative Tools |
ODBC Data Sources; on other systems, you have IODBC or Unix ODBC.

# Parallel operations and computing

In our multicore CPU and clustered computing world, it is imperative for a new language
to have excellent parallel computing capabilities. This is one of the main strengths of Julia:
providing an environment based on message-passing between multiple processes that can
execute on the same machine or on remote machines.

In that sense, it implements the actor model (as Erlang, Elixir, and Pony do), but we'll see
that the actual coding happens on a higher level than receiving and sending messages
between processes, or workers (processors) as Julia calls them. The developer only needs to
explicitly manage the main process from which all other workers are started. The message
send and receive operations are simulated by higher-level operations that look like function
calls.

# Creating processes

Julia can be started as a REPL or as a separate application with a number of workers, n,
available.

The following command starts n processes on the local machine (this command
includes the Distributed package automatically):

In [1]:
# julia -p n

These workers are different processes, not threads, so they do not share memory.

To get the most out of a machine, set n equal to the number of processor cores. For example,
when n is 8, you have, in fact, nine workers: one for the REPL shell itself, and eight others
that are ready to do parallel tasks.

In [2]:
using Distributed

. Every worker has its own integer identifier, which we
can see by calling the workers function, workers(). This returns the following:

In [3]:
workers()

1-element Vector{Int64}:
 1

Each worker can get its own process ID with the myid() function. If you need more
workers, adding new ones is easy:

In [4]:
myid()

1

In [5]:
addprocs(5)

5-element Vector{Int64}:
 2
 3
 4
 5
 6

In [6]:
workers()

5-element Vector{Int64}:
 2
 3
 4
 5
 6

but the addprocs
method accepts arguments to start processes on remote machines via SSH. This is the
secure shell protocol that enables you to execute commands on a remote computer via a
shell in a totally encrypted manner.

The number of available workers is given by nprocs(); in our case, this is 14. A worker
can be removed by calling rmprocs() with its identifier; for example, rmprocs(3) stops
the worker with the ID 3.

In [8]:
nprocs()

6

In [9]:
rmprocs(6)

Task (done) @0x00007feb6c95eca0

In [10]:
workers()

4-element Vector{Int64}:
 2
 3
 4
 5

All these workers communicate via TCP ports and run on the same machine, which is why
it is called a local cluster. To activate workers on a cluster of computers, start Julia as
follows:

In [11]:
# julia --machine-file machines driver.jl

Processors can be dynamically added or removed to a master Julia process, locally on
symmetric multiprocessor systems, remotely on a computer cluster, as well as in the cloud.
If more versatility is needed, you can work with the ClusterManager type

# Using low-level communications

Julia's native parallel computing model is based on two primitives: remote calls and remote
references. At this level, we can give a certain worker a function with arguments to execute
with remotecall, and get the result back with fetch.

As a trivial example in the following
code, we call upon worker 2 to execute a square function on the number 1000:

In [17]:
r1 = remotecall(x -> x ^ 2 ,2 ,1000)

Future(2, 1, 12, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)

The arguments are: the worker ID, the function, and the function's arguments. Such a
remote call returns immediately, thus not blocking the main worker (the REPL in this case).
The main process continues executing while the remote worker does the assigned job. The
remotecall function returns a variable, r1, of the Future type, which is a reference to the
computed result, which we can get using fetch:

In [19]:
fetch(r1)

1000000

The call to fetch will block the main process until worker 2 has finished the calculation.
The main processor can also run wait(r1), which also blocks until the result of the remote
call becomes available. If you need the remote result immediately in the local operation, use
the following command:

In [20]:
remotecall_fetch(sin,3,3.28)

-0.13796586727122684

This is more efficient than fetch(remotecall(..)).

You can also use the @spawnat macro, which evaluates the expression in the second
argument on the worker specified by the first argument:

In [21]:
x = @spawnat 5 sin(2)

Future(5, 1, 15, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)

In [22]:
x

Future(5, 1, 15, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)

In [23]:
fetch(x)

0.9092974268256817

This is made even easier with @spawn, which only needs an expression to evaluate, because
it decides for itself where it will be executed:

In [24]:
x = @spawn sin(2)

Future(2, 1, 17, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)

In [25]:
fetch(x)

0.9092974268256817

To execute a certain function on all the workers, we can use a comprehension:

In [26]:
r = [@spawn sin(i) for i in 1:100]

100-element Vector{Future}:
 Future(3, 1, 19, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)
 Future(4, 1, 20, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)
 Future(5, 1, 21, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)
 Future(2, 1, 22, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)
 Future(3, 1, 23, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)
 Future(4, 1, 24, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.Invasi

In [27]:
x = [fetch(a) for a in r]

100-element Vector{Float64}:
  0.8414709848078965
  0.9092974268256817
  0.1411200080598672
 -0.7568024953079282
 -0.9589242746631385
 -0.27941549819892586
  0.6569865987187891
  0.9893582466233818
  0.4121184852417566
 -0.5440211108893698
 -0.9999902065507035
 -0.5365729180004349
  0.4201670368266409
  ⋮
  0.8600694058124532
  0.8939966636005579
  0.10598751175115685
 -0.7794660696158047
 -0.9482821412699473
 -0.24525198546765434
  0.683261714736121
  0.9835877454343449
  0.3796077390275217
 -0.5733818719904229
 -0.9992068341863537
 -0.5063656411097588

To execute the same statement on all the workers, we can also use the @everywhere macro:

In [36]:
@everywhere begin 
    println("WTF")
    println(myid())
end


WTF
1
      From worker 2:	WTF
      From worker 3:	WTF
      From worker 4:	WTF
      From worker 5:	WTF
      From worker 5:	5
      From worker 3:	3
      From worker 2:	2
      From worker 4:	4


In [37]:
begin 
    a = 2
end


2

All the workers correspond to different processes; they therefore do not share variables, for
example:

In [39]:
@everywhere print(a)

2

LoadError: On worker 2:
UndefVarError: a not defined
Stacktrace:
 [1] top-level scope
[90m   @ [39m[90m[4mnone:1[24m[39m
 [2] [0m[1meval[22m
[90m   @ [39m[90m./[39m[90m[4mboot.jl:373[24m[39m
 [3] [0m[1m#103[22m
[90m   @ [39m[90m/opt/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/[39m[90m[4mprocess_messages.jl:274[24m[39m
 [4] [0m[1mrun_work_thunk[22m
[90m   @ [39m[90m/opt/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/[39m[90m[4mprocess_messages.jl:63[24m[39m
 [5] [0m[1mrun_work_thunk[22m
[90m   @ [39m[90m/opt/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/[39m[90m[4mprocess_messages.jl:72[24m[39m
 [6] [0m[1m#96[22m
[90m   @ [39m[90m./[39m[90m[4mtask.jl:423[24m[39m

...and 3 more exceptions.


The x variable is only known in the main process, all the other workers return the ERROR:
x not defined error message.

@everywhere can also be used to make the data, such as the w variable, available to all
processors, for example, @everywhere w = 8.

In [45]:
@everywhere begin 
    w = 8
    println(w)
end


8
      From worker 3:	8
      From worker 2:	8
      From worker 4:	8
      From worker 5:	8


The following example makes a defs.jl source file available to all the workers:


In [46]:
# @everywhere include("defs.jl")

In [51]:
@everywhere function fib(n)
    if (n < 2) then
        return n
    else return fib(n-1) + fib(n-2)
    end
end

In [52]:
@everywhere println(fib(myid()))

LoadError: On worker 2:
UndefVarError: then not defined
Stacktrace:
 [1] [0m[1mfib[22m
[90m   @ [39m[90m./[39m[90m[4mIn[51]:2[24m[39m
 [2] [0m[1mfib[22m
[90m   @ [39m[90m./[39m[90m[4mIn[51]:4[24m[39m
 [3] top-level scope
[90m   @ [39m[90m[4mnone:1[24m[39m
 [4] [0m[1meval[22m
[90m   @ [39m[90m./[39m[90m[4mboot.jl:373[24m[39m
 [5] [0m[1m#103[22m
[90m   @ [39m[90m/opt/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/[39m[90m[4mprocess_messages.jl:274[24m[39m
 [6] [0m[1mrun_work_thunk[22m
[90m   @ [39m[90m/opt/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/[39m[90m[4mprocess_messages.jl:63[24m[39m
 [7] [0m[1mrun_work_thunk[22m
[90m   @ [39m[90m/opt/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/[39m[90m[4mprocess_messages.jl:72[24m[39m
 [8] [0m[1m#96[22m
[90m   @ [39m[90m./[39m[90m[4mtask.jl:423[24m[39m

...and 4 more exceptions.


In [54]:
@everywhere begin 
    function fib(n)
        if (n < 2) then
            return n
        else return fib(n-1) + fib(n-2)
        end
    end
    println(fib(myid()))
end


LoadError: On worker 2:
UndefVarError: then not defined
Stacktrace:
 [1] [0m[1mfib[22m
[90m   @ [39m[90m./[39m[90m[4mIn[54]:3[24m[39m
 [2] [0m[1mfib[22m[90m (repeats 2 times)[39m
[90m   @ [39m[90m./[39m[90m[4mIn[54]:5[24m[39m
 [3] top-level scope
[90m   @ [39m[90m[4mIn[54]:8[24m[39m
 [4] [0m[1meval[22m
[90m   @ [39m[90m./[39m[90m[4mboot.jl:373[24m[39m
 [5] [0m[1m#103[22m
[90m   @ [39m[90m/opt/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/[39m[90m[4mprocess_messages.jl:274[24m[39m
 [6] [0m[1mrun_work_thunk[22m
[90m   @ [39m[90m/opt/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/[39m[90m[4mprocess_messages.jl:63[24m[39m
 [7] [0m[1mrun_work_thunk[22m
[90m   @ [39m[90m/opt/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/[39m[90m[4mprocess_messages.jl:72[24m[39m
 [8] [0m[1m#96[22m
[90m   @ [39m[90m./[39m[90m[4mtask.jl:423[24m[39m

...and 4 more exceptions.


In order to be able to perform its task, a remote worker needs access to the function it
executes. You can make sure that all workers know about the functions they need by
loading the functions.jl source code with include, making it available to all workers:

A best practice is to separate your code into two files: one file
(functions.jl) that contains the functions and parameters that need to
be run in parallel, and the other file (driver.jl) that manages the
processing and collects the results. Use the include("functions.jl")
command in driver.jl to import the functions and parameters to all
processors.

An alternative is to specify that the files load on the command line. If you need
the file1.jl and file2.jl source files on all the n processors at startup time, use
the julia -p n -L file1.jl -L file2.jl driver.jl syntax, where driver.jl is
the script that organizes the computations.

In [55]:
# julia -p n -L file1.jl -L file2.jl driver.jl

Data-movement between workers (such as when calling fetch) needs to be reduced as
much as possible in order to get performance and scalability.

In [56]:
@everywhere a = "WTF is going on here"

In [57]:
@everywhere println(a)

WTF is going on here
      From worker 5:	WTF is going on here
      From worker 2:	WTF is going on here
      From worker 3:	WTF is going on here
      From worker 4:	WTF is going on here


In [58]:
@everywhere fb(n) = n < 2 ? 1 : fb(n-1) + fb(n-2)

In [60]:
@everywhere println(fb(myid()*3))

3
      From worker 2:	13
      From worker 3:	55
      From worker 4:	233
      From worker 5:	987


If every worker needs to know the d variable, this can be broadcast to all processes with the
following code:

Each worker then has its local copy of data. Scheduling the workers is done with tasks
(refer to the Tasks section of Chapter 4, Control Flow), so that no locking is required; for
example, when a communication operation such as fetch or wait is executed, the current
task is suspended, and the scheduler picks another task to run. When the wait event
completes (for example, the data shows up), the current task is restarted.

In many cases, however, you do not have to specify or create processes to do parallel
programming in Julia, as we will see in the next section.

# Parallel loops and maps

A for loop with a large number of iterations is a good candidate for parallel execution, and
Julia has a special construct to do this: the @parallel macro, which can be used for the for
loops and comprehensions.

In [61]:
function buffon(n)
    hit = 0
    for i = 1:n
        mp = rand()
        phi = (rand() * pi) - pi / 2 # angle at which needle falls
        xright = mp + cos(phi)/2 # x location of needle
        xleft = mp - cos(phi)/2
        # does needle cross either x == 0 or x == 1?
        p = (xright >= 1 || xleft <= 0) ? 1 : 0
        hit += p
    end
    miss = n - hit
    piapprox = n / hit * 2
end

buffon (generic function with 1 method)

In [63]:
@time buffon(10);

  0.000002 seconds


In [64]:
@time buffon(100);

  0.000008 seconds


In [65]:
@time buffon(10000);

  0.000391 seconds


In [66]:
@time buffon(1000000);

  0.040812 seconds


In [68]:
@time buffon(100000000)

  3.439089 seconds


3.1412899936673164

However, what if we could spread the calculations over the available processors? For this,
we have to rearrange our code a bit. In the sequential version, the variable hit is increased
on every iteration inside the for loop with the p amount (which is 0 or 1). In the parallel
version, we rewrite the code, so that this p is exactly the result of the for loop (one
calculation) done on one of the involved processors.

Julia also provides a @distributed macro that acts on a for loop, splitting the range and
distributing it to each process. It optionally takes a "reducer" as its first argument. If a
reducer is specified, the results from each remote procedure will be aggregated using the
reducer. In the following example, we use the (+) function as a reducer, which means that
the last values of the parallel blocks on each worker will be summed to calculate the final
value of hit:

In [69]:
function buffon_par(n)
    hit = @distributed (+) for i = 1:n
        mp = rand()
        phi = (rand() * pi) - pi / 2
        xright = mp + cos(phi)/2
        xleft = mp - cos(phi)/2
        (xright >= 1 || xleft <= 0) ? 1 : 0
    end
    miss = n - hit
    piapprox = n / hit * 2
end

buffon_par (generic function with 1 method)

In [76]:
@time buffon(1000000000);

 35.076468 seconds


In [77]:
@time buffon_par(1000000000);

  9.484346 seconds (295 allocations: 12.906 KiB)


By changing a normal for loop into a parallel-reducing version, we were able to get
substantial improvements in the calculation time, at the cost of higher memory
consumption. In general, always test whether the parallel version really is an improvement
over the sequential version in your specific case!

The first argument of @distributed is the reducing operator (here, (+)), the second is the
for loop, which must start on the same line.

The calculations in the loop must be
independent of one another, because the order in which they run is arbitrary, given that
they are scheduled over the different workers. The actual reduction (summing up in this
case) is done on the calling process.

Any variables used inside the parallel loop will be copied (broadcasted) to each process.
Because of this, the code, such as the following, will fail to initialize the arr array, because
each process has a copy of it:

In [78]:
arr = zeros(1000)
@distributed for i in range(1000)
    arr[i] = 10
end


LoadError: ArgumentError: Cannot construct range from arguments:
start = 1000
step = nothing
stop = nothing
length = nothing
Try specifying more arguments.


If the computational task is to apply a function to all elements in some collection, you can
use a parallel map operation through the pmap function.

The pmap function takes the
following form: pmap(f, coll), applies an f function on each element of the coll
collection in parallel, but preserves the order of the collection in the result. Suppose we
have to calculate the rank of a number of large matrices. We can do this sequentially, as
follows:

In [79]:
using LinearAlgebra
function rank_marray()
    marr = [rand(1000,1000) for i=1:10]
    for arr in marr
        println(LinearAlgebra.rank(arr))
    end
end

rank_marray (generic function with 1 method)

In [88]:
@time rank_marray();

1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
  2.440538 seconds (355 allocations: 158.397 MiB, 1.69% gc time)


In the following, parallelizing also gives benefits (a factor of 1.6):

In [81]:
function prank_marray()
    marr = [rand(1000,1000) for i=1:10]
    println(pmap(LinearAlgebra.rank, marr))
end

prank_marray (generic function with 1 method)

In [87]:
@time prank_marray();

[1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
  2.569863 seconds (925 allocations: 76.324 MiB, 0.04% gc time)


The @distributed macro and pmap are both powerful tools to tackle map-reduce
problems.

Julia's model for building a large parallel application works by means of a global
distributed address space. This means that you can hold a reference to an object that lives
on another machine participating in a computation. These references are easily
manipulated and passed around between machines, making it simple to keep track of
what's being computed where. Also, machines can be added in mid-computation when
needed.