-
Notifications
You must be signed in to change notification settings - Fork 71
/
Profiling.tex
844 lines (715 loc) · 39.8 KB
/
Profiling.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
\lab{Profiling}{Profiling}
\objective{Efficiency is essential to algorithmic programming.
Profiling is the process of measuring the complexity and efficiency of a program, allowing the programmer to see what parts of the code need to be optimized.
In this lab we present common techniques for speeding up Python code, including the built-in profiler and the Numba module.
}
\section*{Magic Commands in IPython} % ----------------------------------------
IPython has tools for quickly timing and profiling code.
These ``magic commands'' start with one or two \li{\%} characters---one for testing a single line of code, and two for testing a block of code.
\begin{itemize}
\item \li{<p<\%time>p>}: Execute some code and print out its execution time.
\item \li{<p<\%timeit>p>}: Execute some code several times and print out the average execution time.
\item \li{<p<\%prun>p>}: Run a statement through the Python code profiler,\footnote{{\color{purple}{\texttt{\%prun}}} is a shortcut for \texttt{cProfile.run()}; see \url{https://docs.python.org/3/library/profile.html} for details.} printing the number of function calls and the time each takes. We will demonstrate this tool a little later.
\end{itemize}
\begin{lstlisting}
# Time the construction of a list using list comprehension.
<g<In [1]:>g> <p<%time>p> x = [i**2 for i in range(int(1e5))]
<<CPU times: user 36.3 ms, sys: 3.28 ms, total: 39.6 ms
Wall time: 40.9 ms>>
# Time the same list construction, but with a regular for loop.
<g<In [2]:>g> <p<%%time>p> # Use a double %% to time a block of code.
<g<...:>g> x = []
<g<...:>g> for i in range(int(1e5)):
<g<...:>g> x.append(i**2)
<g<...:>g>
<<CPU times: user 50 ms, sys: 2.79 ms, total: 52.8 ms
Wall time: 55.2 ms>> # The list comprehension is faster!
\end{lstlisting}
% Use \li{<p<\%time>p>} and \li{<p<\%timeit>p>} to select fast code snippets, functions, and algorithms (for example, using a list comprehension where possible instead of a regular loop).
% For the complete list of magic IPython commands, see \url{http://ipython.readthedocs.io/en/stable/interactive/magics.html}.
\subsection*{Choosing Faster Algorithms} % ------------------------------------
The best way to speed up a program is to use an efficient algorithm.
A bad algorithm, even when implemented well, is never an adequate substitute for a good algorithm.
\begin{problem} % Triangle path sums from Project Euler.
This problem comes from \url{https://projecteuler.net} (problems 18 and 67).
By starting at the top of the triangle below and moving to adjacent numbers on the row below, the maximum total from top to bottom is 23.
\begin{center}
\textbf{\color{red}{3}}\\
\textbf{\color{red}{7}} 4\\
2 \textbf{\color{red}{4}} 6\\
8 5 \textbf{\color{red}{9}} 3
\end{center}
That is, $3 + 7 + 4 + 9 = 23$.
The following function finds the maximum path sum of the triangle in \texttt{triangle.txt} by recursively computing the sum of every possible path---the ``brute force'' approach.
\begin{lstlisting}
def max_path(filename="triangle.txt"):
"""Find the maximum vertical path in a triangle of values."""
with open(filename, 'r') as infile:
data = [[int(n) for n in line.split()]
for line in infile.readlines()]
def path_sum(r, c, total):
"""Recursively compute the max sum of the path starting in row r
and column c, given the current total.
"""
total += data[r][c]
if r == len(data) - 1: # Base case.
return total
else: # Recursive case.
return max(path_sum(r+1, c, total), # Next row, same column.
path_sum(r+1, c+1, total)) # Next row, next column.
return path_sum(0, 0, 0) # Start the recursion from the top.
\end{lstlisting}
The data in \texttt{triangle.txt} contains 15 rows and hence 16384 paths, so it is possible to solve this problem by trying every route.
However, for a triangle with 100 rows, there are $2^{99}$ paths to check, which would take billions of years to compute even for a program that could check one trillion routes per second.
No amount of improvement to \li{max_path()} can make it run in an acceptable amount of time on such a triangle---we need a different algorithm.
Write a function that accepts a filename containing a triangle of integers.
Compute the largest path sum with the following strategy: starting from the next to last row of the triangle, replace each entry with the sum of the current entry and the greater of the two ``child entries.''
Continue this replacement up through the entire triangle.
The top entry in the triangle will be the maximum path sum.
In other words, work from the bottom instead of from the top.
\begin{center}
\begin{tabular}{ccccccc}
\begin{tabular}{c}
3\\
7 4\\
2 4 6\\
\color{red}{8 5 9 3}
\end{tabular}
&$\longrightarrow$&
\begin{tabular}{c}
3\\
7 4\\
\color{red}{10 13 15}\\
\color{black}{8 5 9 3}
\end{tabular}
&$\longrightarrow$&
\begin{tabular}{c}
3\\
\color{red}{20 19}\\
\color{black}{10 13 15}\\
\color{black}{8 5 9 3}
\end{tabular}
&$\longrightarrow$&
\begin{tabular}{c}
\color{red}{\textbf{23}}\\
\color{black}{20 19}\\
\color{black}{10 13 15}\\
\color{black}{8 5 9 3}
\end{tabular}
\end{tabular}
\end{center}
Use your function to find the maximum path sum of the 100-row triangle stored in \texttt{triangle\_large.txt}.
Make sure that your new function still gets the correct answer for the smaller \texttt{triangle.txt}.
Finally, use \li{<p<\%time>p>} or \li{<p<\%timeit>p>} to time both functions on \texttt{triangle.txt}.
Your new function should be about 100 times faster than the original.
\end{problem}
\subsection*{The Profiler} % --------------------------------------------------
The profiling command \li{<p<\%prun>p>} lists the functions that are called during the execution of a piece of code, along with the following information.
\begin{table}[H]
\centering
\begin{tabular}{c|l}
Heading & Description \\ \hline
\li{primitive calls} & The number of calls that were not caused by recursion.\\
\li{ncalls} & The number of calls to the function. If recursion occurs, the output\\ & is \texttt{<total number of calls>/<number of primitive calls>}.\\
\li{tottime} & The amount of time spent in the function, not including calls to other functions.\\
\li{percall} & The amount of time spent in each call of the function.\\
\li{cumtime} & The amount of time spent in the function, including calls to other functions.\\
\end{tabular}
\end{table}
\begin{lstlisting}
# Profile the original function from Problem 1.
<g<In[3]:>g> <p<%prun>p> max_path("triangle.txt")
\end{lstlisting}
{\small
\begin{verbatim}
81947 function calls (49181 primitive calls) in 0.036 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
32767/1 0.025 0.000 0.034 0.034 profiling.py:18(path_sum)
16383 0.005 0.000 0.005 0.000 {built-in method builtins.max}
32767 0.003 0.000 0.003 0.000 {built-in method builtins.len}
1 0.002 0.002 0.002 0.002 {method `readlines' of `_io._IOBase' objects}
1 0.000 0.000 0.000 0.000 {built-in method io.open}
1 0.000 0.000 0.036 0.036 profiling.py:12(max_path)
1 0.000 0.000 0.000 0.000 profiling.py:15(<listcomp>)
1 0.000 0.000 0.036 0.036 {built-in method builtins.exec}
2 0.000 0.000 0.000 0.000 codecs.py:318(decode)
1 0.000 0.000 0.036 0.036 <string>:1(<module>)
15 0.000 0.000 0.000 0.000 {method `split' of `str' objects}
1 0.000 0.000 0.000 0.000 _bootlocale.py:23(getpreferredencoding)
2 0.000 0.000 0.000 0.000 {built-in method _codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {built-in method _locale.nl_langinfo}
1 0.000 0.000 0.000 0.000 codecs.py:259(__init__)
1 0.000 0.000 0.000 0.000 codecs.py:308(__init__)
1 0.000 0.000 0.000 0.000 {method `disable' of `_lsprof.Profiler' objects}
\end{verbatim}
}
\section*{Optimizing Python Code} % ===========================================
A poor implementation of a good algorithm is better than a good implementation of a bad algorithm, but clumsy implementation can still cripple a program's efficiency.
The following are a few important practices for speeding up a Python program.
Remember, however, that such improvements are futile if the algorithm is poorly suited for the problem.
\subsection*{Avoid Repetition} % ----------------------------------------------
% {\small
% \begin{verbatim}
% ncalls tottime percall cumtime percall filename:lineno(function)
% 32767/1 0.025 0.000 0.034 0.034 profiling.py:18(path_sum)
% 16383 0.005 0.000 0.005 0.000 {built-in method builtins.max}
% 32767 0.003 0.000 0.003 0.000 {built-in method builtins.len}
% 1 0.002 0.002 0.002 0.002 {method `readlines' of `_io._IOBase' objects}
% 15 0.000 0.000 0.000 0.000 {method `split' of `str' objects}
% \end{verbatim}
% }
A clean program does no more work than is necessary.
The \li{ncalls} column of the profiler output is especially useful for identifying parts of a program that might be repetitive.
For example, the profile of \li{max_path()} indicates that \li{len()} was called 32,767 times---exactly as many times as \li{path_sum()}.
This is an easy fix: save \li{len(data)} as a variable somewhere outside of \li{path_sum()}.
\begin{lstlisting}
<g<In [4]:>g> def max_path_clean(filename="triangle.txt"):
<g<...:>g> with open(filename, 'r') as infile:
<g<...:>g> data = [[int(n) for n in line.split()]
<g<...:>g> for line in infile.readlines()]
<g<...:>g> N = len(data) # Calculate len(data) outside of path_sum().
<g<...:>g> def path_sum(r, c, total):
<g<...:>g> total += data[r][c]
<g<...:>g> if r == N - 1: # Use N instead of len(data).
<g<...:>g> return total
<g<...:>g> else:
<g<...:>g> return max(path_sum(r+1, c, total),
<g<...:>g> path_sum(r+1, c+1, total))
<g<...:>g> return path_sum(0, 0, 0)
<g<...:>g>
<g<In [5]:>g> <p<%prun>p> max_path_clean("triangle.txt")
\end{lstlisting}
{\small
\begin{verbatim}
49181 function calls (16415 primitive calls) in 0.026 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
32767/1 0.020 0.000 0.025 0.025 <ipython-input-5-9e8c48bb1aba>:6(path_sum)
16383 0.005 0.000 0.005 0.000 {built-in method builtins.max}
1 0.002 0.002 0.002 0.002 {method `readlines' of `_io._IOBase' objects}
1 0.000 0.000 0.000 0.000 {built-in method io.open}
1 0.000 0.000 0.026 0.026 <ipython-input-5-9e8c48bb1aba>:1(max_path_clean)
1 0.000 0.000 0.000 0.000 <ipython-input-5-9e8c48bb1aba>:3(<listcomp>)
1 0.000 0.000 0.027 0.027 {built-in method builtins.exec}
15 0.000 0.000 0.000 0.000 {method `split' of `str' objects}
1 0.000 0.000 0.027 0.027 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 codecs.py:318(decode)
1 0.000 0.000 0.000 0.000 _bootlocale.py:23(getpreferredencoding)
2 0.000 0.000 0.000 0.000 {built-in method _codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {built-in method _locale.nl_langinfo}
1 0.000 0.000 0.000 0.000 codecs.py:308(__init__)
1 0.000 0.000 0.000 0.000 codecs.py:259(__init__)
1 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {method `disable' of `_lsprof.Profiler' objects}
\end{verbatim}
}
Note that the total number of primitive function calls decreased from 49,181 to 16,415.
Using \li{<p<\%timeit>p>} also shows that the run time decreased by about 15\%.
Moving code outside of a loop or an often-used function usually results in a similar speedup.
Another important way of reducing repetition is carefully controlling loop conditions to avoid unnecessary iterations.
Consider the problem of identifying Pythagorean triples, sets of three distinct integers $a < b < c$ such that $a^2 + b^2 = c^2$.
The following function identifies all such triples where each term is less than a parameter $N$ by checking all possible triples.
\begin{lstlisting}
>>> def pythagorean_triples_slow(N):
... """Compute all pythagorean triples with entries less than N."""
... triples = []
... for a in range(1, N): # Try values of a from 1 to N-1.
... for b in range(1, N): # Try values of b from 1 to N-1.
... for c in range(1, N): # Try values of c from 1 to N-1.
... if a**2 + b**2 == c**2 and a < b < c:
... triples.append((a,b,c))
... return triples
...
\end{lstlisting}
Since $a < b < c$ by definition, any computations where $b \le a$ or $c \le b$ are unnecessary.
Additionally, once $a$ and $b$ are chosen, $c$ can be no greater than $\sqrt{a^2 + b^2}$.
The following function changes the loop conditions to avoid these cases and takes care to only compute $a^2 + b^2$ once for each unique pairing $(a,b)$.
\begin{lstlisting}
>>> from math import sqrt
>>> def pythagorean_triples_fast(N):
... """Compute all pythagorean triples with entries less than N."""
... triples = []
... for a in range(1, N): # Try values of a from 1 to N-1.
... for b in range(a+1, N): # Try values of b from a+1 to N-1.
... _sum = a**2 + b**2
... for c in range(b+1, min(int(sqrt(_sum))+1, N)):
... if _sum == c**2:
... triples.append((a,b,c))
... return triples
...
\end{lstlisting}
These improvements have a drastic impact on run time, even though the main approach---checking by brute force---is the same.
\begin{lstlisting}
<g<In [6]:>g> <p<%time>p> triples = pythagorean_triples_slow(500)
<<CPU times: user 1min 51s, sys: 389 ms, total: 1min 51s
Wall time: 1min 52s>> # 112 seconds.
<g<In [7]:>g> <p<%time>p> triples = pythagorean_triples_fast(500)
<<CPU times: user 1.56 s, sys: 5.38 ms, total: 1.57 s
Wall time: 1.57 s>> # 98.6% faster!
\end{lstlisting}
\begin{problem}
The following function computes the first $N$ prime numbers.
\begin{lstlisting}
def primes(N):
"""Compute the first N primes."""
primes_list = []
current = 2
while len(primes_list) < N:
isprime = True
for i in range(2, current): # Check for nontrivial divisors.
if current % i == 0:
isprime = False
if isprime:
primes_list.append(current)
current += 1
return primes_list
\end{lstlisting}
This function takes about 6 minutes to find the first 10,000 primes on a fast computer.
Without significantly modifying the approach, rewrite \li{primes()} so that it can compute 10,000 primes in under 0.1 seconds.
Use the following facts to reduce unnecessary iterations.
\begin{itemize}
\item A number is not prime if it has one or more divisors other than 1 and itself.
\\(Hint: recall the \li{break} statement.)
\item If $p\nmid n$, then $ap\nmid n$ for any integer $a$.
Also, if $p \mid n$ and $0 < p < n$, then $p \le \sqrt{n}$.
\item Except for $2$, primes are always odd.
\end{itemize}
Your new function should be helpful for solving problem 7 on \url{https://projecteuler.net}.
\label{prob:profiling-primes-naive}
\end{problem}
\subsection*{Avoid Loops} % ---------------------------------------------------
% Most repetition occurs in a looping structure.
% \textbf{Avoid loops where possible, especially nested loops} (loops within loops).
% If nested loops are unavoidable, focus optimization efforts on the innermost loop, since that part of the code gets the most repetitions.
NumPy routines and built-in functions are often useful for eliminating loops altogether. %, a process called \emph{vectorization}.
Consider the simple problem of summing the rows of a matrix, implemented in three ways.
\begin{lstlisting}
>>> def row_sum_awful(A):
... """Sum the rows of A by iterating through rows and columns."""
... m,n = A.shape
... row_totals = np.empty(m) # Allocate space for the output.
... for i in range(m): # For each row...
... total = 0
... for j in range(n): # ...iterate through the columns.
... total += A[i,j]
... row_totals[i] = total # Record the total.
... return row_totals
...
>>> def row_sum_bad(A):
... """Sum the rows of A by iterating through rows."""
... return np.array([sum(A[i,:]) for i in range(A.shape[0])])
...
>>> def row_sum_fast(A):
... """Sum the rows of A with NumPy."""
... return np.<<sum>>(A, axis=1) # Or A.sum(axis=1).
...
\end{lstlisting}
None of the functions are fundamentally different, but their run times differ dramatically.
\begin{lstlisting}
<g<In [8]:>g> import numpy as np
<g<In [9]:>g> A = np.random.random((10000, 10000))
<g<In [10]:>g> <p<%time>p> rows = row_sum_awful(A)
<<CPU times: user 22.7 s, sys: 137 ms, total: 22.8 s
Wall time: 23.2 s>> # SLOW!
<g<In [11]:>g> <p<%time>p> rows = row_sum_bad(A)
<<CPU times: user 8.85 s, sys: 15.6 ms, total: 8.87 s
Wall time: 8.89 s>> # Slow!
<g<In [12]:>g> <p<%time>p> rows = row_sum_fast(A)
<<CPU times: user 61.2 ms, sys: 1.3 ms, total: 62.5 ms
Wall time: 64 ms>> # Fast!
\end{lstlisting}
In this experiment, \li{row_sum_fast()} runs several hundred times faster than \li{row_sum_awful()}.
This is primarily because looping is expensive in Python, but NumPy handles loops in C, which is much quicker.
Other NumPy functions like \li{np.<<sum>>()} with an \li{axis} argument can often be used to eliminate loops in a similar way.
\begin{problem} % Naive Nearest Neighbor with vectorization.
Let $A$ be an $m\times n$ matrix with columns $\a_0, \ldots, \a_{n-1}$, and let $\x$ be a vector of length $m$.
The \emph{nearest neighbor problem}\footnote{The nearest neighbor problem is a common problem in many fields of artificial intelligence. The problem can be solved more efficiently with a $k$-d tree, a specialized data structure for storing high-dimensional data.} is to determine which of the columns of $A$ is ``closest'' to $\x$ with respect to some norm.
That is, we compute
\[\underset{j}{\text{argmin }} \|\a_j - \x\|.\]
The following function solves this problem na\"ively for the usual Euclidean norm.
\begin{lstlisting}
def nearest_column(A, x):
"""Find the index of the column of A that is closest to x."""
distances = []
for j in range(A.shape[1]):
distances.append(np.linalg.norm(A[:,j] - x))
return np.argmin(distances)
\end{lstlisting}
Write a new version of this function without any loops or list comprehensions, using array broadcasting and the \li{axis} keyword in \li{np.linalg.norm()} to eliminate the existing loop.
Try to implement the entire function in a single line.
\\(Hint: See the NumPy Visual Guide in the Appendix for a refresher on array broadcasting.)
Profile the old and new versions with \li{<p<\%prun>p>} and compare the output.
Finally, use \li{<p<\%time>p>} or \li{<p<\%timeit>p>} to verify that your new version runs faster than the original.
\end{problem}
\subsection*{Use Data Structures Correctly} % ---------------------------------
% TODO: set/dict lookup, avoiding list indexing in loops, list comprehensions.
Every data structure has strengths and weaknesses, and choosing the wrong data structure can be costly.
Here we consider three ways to avoid problems and use sets, dictionaries, and lists correctly.
\begin{itemize}
\item \textbf{Membership testing}. The question ``is \li{<value>} a member of \li{<container>}'' is common in numerical algorithms.
Sets and dictionaries are implemented in a way that makes this a trivial problem, but lists are not.
In other words, the \li{in} operator is near instantaneous with sets and dictionaries, but not with lists.
\begin{lstlisting}
<g<In [13]:>g> a_list = list(range(int(1e7)))
<g<In [14]:>g> a_set = set(a_list)
<g<In [15]:>g> <p<%timeit>p> 12.5 in a_list
<<413 ms +- 48.2 ms per loop (mean+-std.dev. of 7 runs, 1 loop each)>>
<g<In [16]:>g> <p<%timeit>p> 12.5 in a_set
<<170 ns +- 3.8 ns per loop (mean+-std.dev. of 7 runs, 10000000 loops each)>>
\end{lstlisting}
Looking up dictionary values is also almost immediate.
Use dictionaries for storing calculations to be reused, such as mappings between letters and numbers or common function outputs.
\item \textbf{Construction with comprehension}.
Lists, sets, and dictionaries can all be constructed with comprehension syntax.
This is slightly faster than building the collection in a loop, and the code is highly readable.
% TODO (?): map().
\begin{lstlisting}
# Map the integers to their squares.
<g<In [17]:>g> <p<%%time>p>
<g<...:>g> a_dict = {}
<g<...:>g> for i in range(1000000):
<g<...:>g> a_dict[i] = i**2
<g<...:>g>
<<CPU times: user 432 ms, sys: 54.4 ms, total: 486 ms
Wall time: 491 ms>>
<g<In [18]:>g> <p<%time>p> a_dict = {i:i**2 for i in range(1000000)}
<<CPU times: user 377 ms, sys: 58.9 ms, total: 436 ms
Wall time: 440 ms>>
\end{lstlisting}
\item \textbf{Intelligent iteration}.
Unlike looking up dictionary values, indexing into lists takes time.
Instead of looping over the indices of a list, loop over the entries themselves.
When indices and entries are both needed, use \li{enumerate()} to get the index and the item simultaneously.
\begin{lstlisting}
<g<In [19]:>g> a_list = list(range(1000000))
<g<In [20]:>g> <p<%%time>p> # Loop over the indices of the list.
<g<...:>g> for i in range(len(a_list)):
<g<...:>g> item = a_list[i]
<g<...:>g>
<<CPU times: user 103 ms, sys: 1.78 ms, total: 105 ms
Wall time: 107 ms>>
<g<In [21]:>g> <p<%%time>p> # Loop over the items in the list.
<g<...:>g> for item in a_list:
<g<...:>g> _ = item
<g<...:>g>
<<CPU times: user 61.2 ms, sys: 1.31 ms, total: 62.5 ms
Wall time: 62.5 ms>> # Almost twice as fast as indexing!
\end{lstlisting}
% <g<In [X]:>g> <p<%%time>p> # Use enumerate() to get both indices and items.
% <g<...:>g> for i, item in enumerate(a_list):
% <g<...:>g> _ = item
% <g<...:>g>
% <<CPU times: user 92.5 ms, sys: 1.58 ms, total: 94.1 ms
% Wall time: 94.4 ms>> # Still slightly faster than indexing.
% \end{lstlisting}
\end{itemize}
\begin{comment} % USELESS
Second, swap values with a single assignment.
\begin{lstlisting}
>>> a, b = 1, 2
>>> a, b = b, a
>>> a, b
(2, 1)
\end{lstlisting}
Third, many non-Boolean objects in Python have truth values.
For example, numbers are \li{False} when equal to zero and \li{True} otherwise.
Similarly, lists and strings are \li{False} when they are empty and \li{True} otherwise.
The following code gives some examples.
\begin{lstlisting}
# Use the truth values of numbers.
>>> if 10:
... print("Non-zero")
...
Non-zero
# Use the truth values of a list.
>>> my_list = [i for i in range(5)]
>>> if my_list:
... print(my_list[0])
...
0
\end{lstlisting}
\end{comment}
\begin{problem} % Name scores.
This is problem 22 from \url{https://projecteuler.net}.
Using the rule $A\mapsto 1, B\mapsto 2, \ldots, Z\mapsto 26$, the \emph{alphabetical value} of a name is the sum of the digits that correspond to the letters in the name.
For example, the alphabetic value of ``COLIN'' is $3 + 15 + 12 + 9 + 14 = 53$.
The following function reads the file \texttt{names.txt}, containing over five-thousand first names, and sorts them in alphabetical order.
The \emph{name score} of each name in the resulting list is the alphabetic value of the name multiplied by the name's position in the list, starting at 1.
``COLIN'' is the 938th name alphabetically, so its name score is $938 \times 53 = 49714$.
The function returns the total of all the name scores in the file.
\begin{lstlisting}
def name_scores(filename="names.txt"):
"""Find the total of the name scores in the given file."""
with open(filename, 'r') as infile:
names = sorted(infile.read().replace('"', '').split(','))
total = 0
for i in range(len(names)):
name_value = 0
for j in range(len(names[i])):
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
for k in range(len(alphabet)):
if names[i][j] == alphabet[k]:
letter_value = k + 1
name_value += letter_value
total += (names.index(names[i]) + 1) * name_value
return total
\end{lstlisting}
Rewrite this function---removing repetition, eliminating loops, and using data structures correctly---so that it runs in less than 10 milliseconds on average.
\end{problem}
\subsection*{Use Generators} % ------------------------------------------------
A \emph{generator} is an iterator that yields multiple values, one at a time, as opposed to returning a single value.
For example, \li{range()} is a generator.
Using generators appropriately can reduce both the run time and the spatial complexity of a routine.
Consider the following function, which constructs a list containing the entries of the sequence $\{x_n\}_{n=1}^N$ where $x_{n} = x_{n-1} + n$ with $x_1 = 1$.
\begin{lstlisting}
>>> def sequence_function(N):
... """Return the first N entries of the sequence x_n = x_{n-1} + n."""
... sequence = []
... x = 0
... for n in range(1, N+1):
... x += n
... sequence.append(x)
... return sequence
...
>>> sequence_function(10)
[1, 3, 6, 10, 15, 21, 28, 36, 45, 55]
\end{lstlisting}
A potential problem with this function is that all of the values in the list are computed before anything is returned.
This can be a big issue if the parameter $N$ is large.
A generator, on the other hand, \emph{yields} one value at a time, indicated by the keyword \li{yield} (instead of \li{return}).
When the generator is asked for the next entry, the code resumes right where it left off.
% The only visible difference between a generator and a function is the use of \li{yield} in place of \li{return}.
% In the following example, note that \li{sequence_generator()} does not keep track of the entire sequence like \li{sequence_function()} does.
\begin{lstlisting}
>>> def sequence_generator(N):
... """Yield the first N entries of the sequence x_n = x_{n-1} + n."""
... x = 0
... for n in range(1, N+1):
... x += n
... yield x # "return" a single value.
...
# Get the entries of the generator one at a time with next().
>>> generated = sequence_generator(10)
>>> next(generated)
1
>>> next(generated)
3
>>> next(generated)
6
# Put each of the generated items in a list, as in sequence_function().
>>> list(sequence_generator(10)) # Or [i for i in sequence_generator(10)].
[1, 3, 6, 10, 15, 21, 28, 36, 45, 55]
# Use the generator in a for loop, like range().
>>> for entry in sequence_generator(10):
... print(entry, end=' ')
...
1 3 6 10 15 21 28 36 45 55
\end{lstlisting}
Many generators, like \li{range()} and \li{sequence_generator()}, only yield a finite number of values.
However, generators can also continue yielding indefinitely.
For example, the following generator yields the terms of $\{x_n\}_{n=1}^\infty$ forever.
In this case, using \li{enumerate()} with the generator is helpful for tracking the index $n$ as well as the entry $x_n$.
\begin{lstlisting}
>>> def sequence_generator_forever():
... """Yield the sequence x_n = x_{n-1} + n forever."""
... x = 0
... n = 1
... while True:
... x += n
... n += 1
... yield x # "return" a single value.
...
# Sum the entries of the sequence until the sum exceeds 1000.
>>> total = 0
>>> for i, x in enumerate(sequence_generator_forever()):
... total += x
... if total > 1000:
... print(i) # Print the index where the total exceeds.
... break # Break out of the for loop to stop iterating.
...
17
# Check that 18 terms are required (since i starts at 0 but n starts at 1).
>>> print(sum(sequence_generator(17)), sum(sequence_generator(18)))
969 1140
\end{lstlisting}
\begin{warn} % Use xrange() in Python 2.
In Python 2.7 and earlier, \li{range()} is \textbf{not} a generator.
Instead, it constructs an entire list of values, which is often significantly slower than yielding terms individually as needed.
If you are using old versions of Python, use \li{xrange()}, the equivalent of \li{range()} in Python 3.0 and later.
\end{warn}
\begin{problem} % Fibonacci sequence.
This is problem 25 from \url{https://projecteuler.net}.
The \emph{Fibonacci sequence} is defined by the recurrence relation $F_{n} = F_{n-1} + F_{n-2}$, where $ F_1 = F_2 = 1$.
The 12th term, $F_{12} = 144$, is the first term to contain three digits.
Write a generator that yields the terms of the Fibonacci sequence indefinitely.
Next, write a function that accepts an integer $N$.
Use your generator to find the first term in the Fibonacci sequence that contains $N$ digits.
Return the index of this term.
\\(Hint: a generator can have more than one \li{yield} statement.)
\end{problem}
% See \url{https://docs.python.org/3/tutorial/classes.html#generators} for more about generators.
% and \url{https://wiki.python.org/moin/Generators}
\begin{problem} % Sieve of Eratosthenes.
The function in Problem \ref{prob:profiling-primes-naive} could be turned into a prime number generator that yields primes indefinitely, but it is not the only strategy for yielding primes.
The \emph{Sieve of Eratosthenes}\footnote{See \url{https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes}.} is a faster technique for finding all of the primes below a certain number.
\begin{enumerate}
\item Given a cap $N$, start with all of the integers from $2$ to $N$.
\item Remove all integers that are divisible by the first entry in the list. \label{step:profiling-sieve-of-eratos}
\item Yield the first entry in the list and remove it from the list.
\item Return to step \ref{step:profiling-sieve-of-eratos} until the list is empty.
\end{enumerate}
Write a generator that accepts an integer $N$ and that yields all primes (in order, one at a time) that are less than $N$ using the Sieve of Eratosthenes.
Your generator should be able to find all primes less than 100,000 in under $5$ seconds.
Your generator and your fast function from Problem \ref{prob:profiling-primes-naive} may be helpful in solving problems 10, 35, 37, 41, 49, and 50 (for starters) of \url{https://projecteuler.net}.
\end{problem}
\section*{Numba} % ============================================================
Python code is simpler and more readable than many languages, but Python is also generally much slower than compiled languages like C.
The \li{numba} module\footnote{Numba is \textbf{not} part of the standard library, but it is included in the Anaconda distribution. For installation details, see \url{https://numba.pydata.org/}.} bridges the gap by using \emph{just-in-time} (JIT) compilation to optimize code, meaning that the code is actually compiled right before execution.
\begin{lstlisting}
>>> from numba import jit
>>> @jit # Decorate a function with @jit to use Numba.
... def row_sum_numba(A):
... """Sum the rows of A by iterating through rows and columns,
... optimized by Numba.
... """
... m,n = A.shape
... row_totals = np.empty(m)
... for i in range(m):
... total = 0
... for j in range(n):
... total += A[i,j]
... row_totals[i] = total
... return row_totals
...
\end{lstlisting}
Python is a \emph{dynamically typed} language, meaning variables are not defined explicitly with a datatype (\li{x = 6} as opposed to \li{int x = 6}).
This particular aspect of Python makes it flexible, easy to use, and slow.
% One of the reasons compiled languages like C are so much faster than Python is because they have explicitly defined datatypes.
Numba speeds up Python code primarily by assigning datatypes to all the variables.
Rather than requiring explicit definitions for datatypes, Numba attempts to infer the correct datatypes based on the datatypes of the input.
In \li{row_sum_numba()}, if \li{A} is an array of integers, Numba will infer that \li{total} should also be an integer.
On the other hand, if \li{A} is an array of floats, Numba will infer that \li{total} should be a \emph{double} (a similar datatype to float in C).
Once all datatypes have been inferred and assigned, the original Python code is translated to machine code. % by the LLVM library.
Numba caches this compiled version of code for later use.
The first function call takes the time to compile and then execute the code, but subsequent calls use the already-compiled code.
\begin{lstlisting}
<g<In [22]:>g> A = np.random.random((10000, 10000))
# The first function call takes a little extra time to compile first.
<g<In [23]:>g> <p<%time>p> rows = row_sum_numba(A)
<<CPU times: user 408 ms, sys: 11.5 ms, total: 420 ms>>
Wall time: 425 ms
# Subsequent calls are consistently faster that the first call.
<g<In [24]:>g> <p<%timeit>p> row_sum_numba(A)
<<138 ms +- 1.96 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)>>
\end{lstlisting}
Note that the only difference between \li{row_sum_numba()} and \li{row_sum_awful()} from a few pages ago is the \li{@jit} decorator, and yet the Numba version is about 99\% faster than the original!
The inference engine within Numba does a good job, but it's not always perfect.
Adding the keyword argument \li{nopython=True} to the \li{@jit} decorator raises an error if Numba is unable to convert each variable to explicit datatypes.
The \li{inspect_types()} method can also be used to check if Numba is using the desired types.
\begin{lstlisting}
# Run the function once first so that it compiles.
>>> rows = row_sum_numba(np.random.random((10,10)))
>>> row_sum_numba.inspect_types()
# The output is very long and detailed.
\end{lstlisting}
Alternatively, datatypes can be specified explicitly in the \li{@jit} decorator as a dictionary via the \li{<<locals>>} keyword argument.
Each of the desired datatypes must also be imported from Numba.
\begin{lstlisting}
>>> from numba import int64, double
>>> @jit(nopython=True, <<locals>>=dict(A=double[:,:], m=int64, n=int64,
... row_totals=double[:], total=double))
... def row_sum_numba(A): # 'A' is a 2-D array of doubles.
... m,n = A.shape # 'm' and 'n' are both integers.
... row_totals = np.empty(m) # 'row_totals' is a 1-D array of doubles.
... for i in range(m):
... total = 0 # 'total' is a double.
... for j in range(n):
... total += A[i,j]
... row_totals[i] = total
... return row_totals
...
\end{lstlisting}
While it sometimes results in a speed boost, there is a caveat to specifying the datatypes: \li{row_sum_numba()} no longer accepts arrays that contain anything other than floats.
When datatypes are not specified, Numba compiles a new version of the function each time the function is called with a different kind of input.
Each compiled version is saved, so the function can still be used flexibly.
\begin{problem} % Compare times for Numba.
The following function calculates the $n$th power of an $m\times m$ matrix $A$.
\begin{lstlisting}
def matrix_power(A, n):
"""Compute A^n, the n-th power of the matrix A."""
product = A.copy()
temporary_array = np.empty_like(A[0])
m = A.shape[0]
for power in range(1, power):
for i in range(m):
for j in range(m):
total = 0
for k in range(m):
total += product[i,k] * A[k,j]
temporary_array[j] = total
product[i] = temporary_array
return product
\end{lstlisting}
% TODO: review time.time() (they might not have seen it in PythonEssentials).
\begin{enumerate}
\item Write a Numba-enhanced version of \li{matrix_power()} called \li{matrix_power_numba()}.
\item Write a function that accepts an integer $n$.
Run \li{matrix_power_numba()} once with a small random input so it compiles.
Then, for $m=2^2,2^3,\ldots,2^7$,
\begin{enumerate}
\item Generate a random $m\times m$ matrix $A$ with \li{np.random.random()}.
\item Time (separately) \li{matrix_power()}, \li{matrix_power_numba()}, and NumPy's \\ \li{np.linalg.matrix_power()} on $A$ with the specified value of $n$.
\\(If you are unfamiliar with timing code inside of a function, see the \\ Additional Material section on timing code.)
\end{enumerate}
Plot the times against the size $m$ on a log-log plot (use \li{plt.loglog()}).
\end{enumerate}
With $n=10$, the plot should show that the Numba and NumPy versions far outperform the pure Python implementation, with NumPy eventually becoming faster than Numba.
% NumPy takes products of matrices by calling BLAS and LAPACK, which are heavily optimized linear algebra libraries written in C, assembly, and Fortran.
\end{problem}
\begin{warn}
Optimizing code is an important skill, but it is also important to know when to refrain from optimization.
The best approach to coding is to write unit tests, implement a solution that works, test and time that solution, \textbf{then} (and only then) optimize the solution with profiling techniques.
As always, the most important part of the process is choosing the correct algorithm to solve the problem.
Don't waste time optimizing a poor algorithm.
\end{warn}
\newpage
\section*{Additional Material} % ==============================================
\subsection*{Other Timing Techniques} % ---------------------------------------
Though \li{<p<\%time>p>} and \li{<p<\%timeit>p>} are convenient and work well, some problems require more control for measuring execution time.
The usual way of timing a code snippet by hand is via the \li{time} module (which \li{<p<\%time>p>} uses).
The function \li{time.time()} returns the number of seconds since the Epoch\footnote{See \url{https://en.wikipedia.org/wiki/Epoch_(reference_date)\#Computing}.}; to time code, measure the number of seconds before the code runs, the number of seconds after the code runs, and take the difference.
\begin{lstlisting}
>>> import time
>>> start = time.time() # Record the current time.
>>> for i in range(int(1e8)): # Execute some code.
... pass
... end = time.time() # Record the time again.
... print(end - start) # Take the difference.
...
4.20402193069458 # (seconds)
\end{lstlisting}
The \li{timeit} module (which \li{<p<\%timeit>p>} uses) has tools for running code snippets several times.
The code is passed in as a string, as well as any setup code to be run before starting the clock.
\begin{lstlisting}
>>> import timeit
>>> timeit.timeit("for i in range(N): pass", setup="N = int(1e6)", number=200)
4.884839255013503 # Total time in seconds to run the code 200 times.
>>> _ / 200
0.024424196275067516 # Average time in seconds.
\end{lstlisting}
The primary advantages of these techniques are the ability automate timing code and being able save the results.
For more documentation, see \url{https://docs.python.org/3.6/library/time.html} and \url{https://docs.python.org/3.6/library/timeit.html}.
\subsection*{Customizing the Profiler} % --------------------------------------
The output from \li{<p<\%prun>p>} is generally long, but it can be customized with the following options.
\begin{table}[H]
\centering
\begin{tabular}{l|l}
Option & Description \\ \hline
\li{-l <limit>} & Include a limited number of lines in the output.\\
\li{-s <key>} & Sort the output by call count, cumulative time, function name, etc. \\
\li{-T <filename>} & Save profile results to a file (results are still printed).\\
\end{tabular}
\end{table}
For example, \li{<p<\%prun>p> -l 3 -s ncalls -T path_profile.txt max_path()} generates a profile of \li{max_path()} that lists the 3 functions with the most calls, then write the results to \texttt{path\_profile.txt}.
See \url{http://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-prun} for more details.