/
overview-slides.html
193 lines (193 loc) · 11.7 KB
/
overview-slides.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="generator" content="pandoc" />
<title></title>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>
<link rel="stylesheet" type="text/css" media="screen, projection, print"
href="http://www.w3.org/Talks/Tools/Slidy2/styles/slidy.css" />
<script src="http://www.w3.org/Talks/Tools/Slidy2/scripts/slidy.js"
charset="utf-8" type="text/javascript"></script>
</head>
<body>
<div id="flexible-parallel-processing-using-the-r-future-and-python-dask-packages" class="slide section level1">
<h1>Flexible parallel processing using the R future and Python Dask packages</h1>
<p>Chris Paciorek, Department of Statistics, UC Berkeley</p>
</div>
<div id="this-tutorial" class="slide section level1">
<h1>1) This Tutorial</h1>
<p>This tutorial covers the use of R's future and Python's Dask packages, well-established tools for parallelizing computions on a single machine or across multiple machines. There is also a bit of material on Python's Ray package, which was developed more recently.</p>
<p>You should be able to replicate much of what is covered here provided you have Rand Python on your computer, but some of the parallelization approaches may not work on Windows.</p>
<p>This tutorial assumes you have a working knowledge of either R or Python, but not necessarily knowledge of parallelization in R or Python.</p>
<p>Materials for this tutorial, including the Markdown files and associated code files that were used to create this document are available on GitHub at [https://github.com/berkeley-scf/tutorial-dask-future]. You can download the files by doing a git clone from a terminal window on a UNIX-like machine, as follows:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">git clone https<span class="op">:</span><span class="er">//</span>github.com<span class="op">/</span>berkeley<span class="op">-</span>scf<span class="op">/</span>tutorial<span class="op">-</span>dask<span class="op">-</span>future</code></pre></div>
<p>See the <code>Makefile</code> for how to generate the html files in this tutorial.</p>
<p>This tutorial by Christopher Paciorek is licensed under a Creative Commons Attribution 3.0 Unported License.</p>
</div>
<div id="some-useful-terminology" class="slide section level1">
<h1>2) Some useful terminology:</h1>
<ul>
<li><em>cores</em>: We'll use this term to mean the different processing units available on a single node.</li>
<li><em>nodes</em>: We'll use this term to mean the different computers, each with their own distinct memory, that make up a cluster or supercomputer.</li>
<li><em>processes</em>: instances of a program executing on a machine; multiple processes may be executing at once. A given executable (e.g., Python or R) may start up multiple processes at once. Ideally we have no more user processes than cores on a node.</li>
<li><em>threads</em>: multiple paths of execution within a single process; the OS sees the threads as a single process, but one can think of them as 'lightweight' processes. Ideally when considering the processes and their threads, we would have the number of total threads across all processes not exceed the number of cores on a node.</li>
<li><em>forking</em>: child processes are spawned that are identical to the parent, but with different process IDs and their own memory.</li>
<li><em>sockets</em>: some of R's parallel functionality involves creating new R processes (e.g., starting processes via <em>Rscript</em>) and communicating with them via a communication technology called sockets.</li>
<li><em>tasks</em>: This term gets used in various ways (including in place of 'processes'), but we'll use it to refer to the individual computational items you want to complete - e.g., one task per cross-validation fold or one task per simulation replicate/iteration.</li>
</ul>
</div>
<div id="types-of-parallel-processing" class="slide section level1">
<h1>3) Types of parallel processing</h1>
<p>There are two basic flavors of parallel processing (leaving aside GPUs): distributed memory and shared memory. With shared memory, multiple processors (which I'll call cores) share the same memory. With distributed memory, you have multiple nodes, each with their own memory. You can think of each node as a separate computer connected by a fast network.</p>
<h2 id="shared-memory">3.1) Shared memory</h2>
<p>For shared memory parallelism, each core is accessing the same memory so there is no need to pass information (in the form of messages) between different machines. But in some programming contexts one needs to be careful that activity on different cores doesn't mistakenly overwrite places in memory that are used by other cores.</p>
<p>However, except for certain special situations, the different worker processes on a given machine do not share objects in memory. So most often, one has multiple copies of the same objects, one per worker process.</p>
<h3 id="threading">Threading</h3>
<p>Threads are multiple paths of execution within a single process. If you are monitoring CPU usage (such as with <em>top</em> in Linux or Mac) and watching a job that is executing threaded code, you'll see the process using more than 100% of CPU. When this occurs, the process is using multiple cores, although it appears as a single process rather than as multiple processes.</p>
<p>Note that this is a different notion than a processor that is hyperthreaded. With hyperthreading a single core appears as two cores to the operating system.</p>
<p>Threads generally do share objects in memory, thereby allowing us to have a single copy of objects instead of one per thread.</p>
<h2 id="distributed-memory">3.2) Distributed memory</h2>
<p>Parallel programming for distributed memory parallelism requires passing messages between the different nodes.</p>
<p>The standard protocol for passing messages is MPI, of which there are various versions, including <em>openMPI</em>.</p>
<p>Tools such as Dask, Ray and future all manage the work of moving information between nodes for you (and don't generally use MPI).</p>
<h2 id="other-type-of-parallel-processing">3.3) Other type of parallel processing</h2>
<p>We won't cover either of these in this material.</p>
<h3 id="gpus">GPUs</h3>
<p>GPUs (Graphics Processing Units) are processing units originally designed for rendering graphics on a computer quickly. This is done by having a large number of simple processing units for massively parallel calculation. The idea of general purpose GPU (GPGPU) computing is to exploit this capability for general computation.</p>
<p>Most researchers don't program for a GPU directly but rather use software (often machine learning software such as Tensorflow or PyTorch) that has been programmed to take advantage of a GPU if one is available.</p>
<h3 id="spark-and-hadoop">Spark and Hadoop</h3>
<p>Spark and Hadoop are systems for implementing computations in a distributed memory environment, using the MapReduce approach.</p>
<p>Note that Dask provides a lot of the same functionality as Spark, allowing one to create distributed datasets where pieces of the dataset live on different machines but can be treated as a single dataset from the perspective of the user.</p>
</div>
<div id="package-comparison" class="slide section level1">
<h1>4) Package comparison</h1>
<p>See this outline for an overview comparison of future ( R ), Dask (Python) and Ray (Python)</p>
<ul>
<li>Use cases
<ul>
<li>future
<ul>
<li>parallelizing tasks</li>
<li>one or more machines (nodes)</li>
</ul></li>
<li>Dask
<ul>
<li>parallelizing tasks</li>
<li>distributed datasets</li>
<li>one or more machines (nodes)</li>
</ul></li>
<li>Ray
<ul>
<li>parallelizing tasks</li>
<li>building distributed applications (interacting tasks)</li>
<li>one or more machines (nodes)</li>
</ul></li>
</ul></li>
<li>Task allocation
<ul>
<li>future
<ul>
<li>static by default</li>
<li>dynamic optional</li>
</ul></li>
<li>Dask
<ul>
<li>dynamic</li>
</ul></li>
<li>Ray
<ul>
<li>dynamic (I think)</li>
</ul></li>
</ul></li>
<li>Shared memory
<ul>
<li>future
<ul>
<li>shared across workers only if using forked processes (<code>multicore</code> plan on Linux/MacOS)
<ul>
<li>if data modified, copies need to be made</li>
</ul></li>
<li>data shared across static tasks assigned to a worker</li>
</ul></li>
<li>Dask
<ul>
<li>shared across workers only if using <code>threads</code> scheduler</li>
<li>data shared across tasks on a worker if data are delayed</li>
</ul></li>
<li>Ray
<ul>
<li>shared across all workers on a node via the object store (nice!)</li>
</ul></li>
</ul></li>
<li>Copies made
<ul>
<li>future
<ul>
<li>generally copy one per worker, but one per task with dynamic allocation</li>
</ul></li>
<li>Dask
<ul>
<li>generally one copy per worker if done correctly (data should be delayed)</li>
</ul></li>
<li>Ray
<ul>
<li>one copy per node if done correctly (use <code>ray.put</code> to use object store)</li>
</ul></li>
</ul></li>
</ul>
</div>
<div id="pythons-dask-package" class="slide section level1">
<h1>5) Python's Dask package</h1>
<p>Please see the <a href="python-dask.html">Python dask materials</a>.</p>
</div>
<div id="r-future-package" class="slide section level1">
<h1>6) R future package</h1>
<p>Please see <a href="R-future.html">R future materials</a>.</p>
</div>
<div id="pythons-ray-package" class="slide section level1">
<h1>7) Python's Ray package</h1>
<p>For a very brief introduction to Ray, please see the <a href="python-ray.html">Python Ray materials</a>.</p>
</div>
</body>
</html>