Skip to content

Conversation

chriselrod
Copy link
Collaborator

The main change is to macrokernels.jl, which adds a macro kernel for convenience.

It also passes types around as Val{T} instead of Type{T}, which probably should've gone in a separate PR.

I am not sure if it should be merged.
It makes compile times a fair bit worse, and from limited benchmarks doesn't really seem to help performance.

@chriselrod
Copy link
Collaborator Author

gemm_Float64_10_10000_cascadelake_AVX512__multithreaded_logscale

 Row │ Size   Octavian
     │ Int64  Float64
─────┼──────────────────
 2694886  1800.67
 2705000  1807.82
 2715117  1788.49
 2725237  1798.98
 2735359  1728.39
 2745484  1707.73
 2755613  1750.0
 2765744  1727.5
 2775878  1726.07
 2786015  1709.47
 2796156  1758.61
 2806300  1789.8
 2816447  1816.43
 2826598  1810.03
 2836752  1828.7
 2846910  1791.21
 2857071  1741.78
 2867237  1726.9
 2877406  1715.8
 2887579  1728.31
 2897756  1762.05
 2907937  1751.56
 2918123  1754.17
 2928313  1769.72
 2938507  1753.06
 2948706  1785.01
 2958909  1751.57
 2969117  1768.85
 2979330  1781.09
 2989548  1833.02
 2999772  1829.03
 30010000  1842.31

I'll add benchmarks from the master branch in an hour or so.

@chriselrod
Copy link
Collaborator Author

Current master:

gemm_Float64_10_10000_cascadelake_AVX512__multithreaded_logscale

 Row │ Size   Octavian
     │ Int64  Float64
 2694886  1745.31
 2705000  1696.69
 2715117  1779.43
 2725237  1759.75
 2735359  1657.65
 2745484  1696.74
 2755613  1651.58
 2765744  1635.12
 2775878  1616.55
 2786015  1636.0
 2796156  1687.92
 2806300  1769.92
 2816447  1741.46
 2826598  1751.08
 2836752  1783.15
 2846910  1773.49
 2857071  1733.6
 2867237  1670.6
 2877406  1649.37
 2887579  1660.92
 2897756  1742.86
 2907937  1736.86
 2918123  1733.38
 2928313  1737.56
 2938507  1746.48
 2948706  1741.73
 2958909  1690.63
 2969117  1599.14
 2979330  1735.78
 2989548  1632.21
 2999772  1747.52
 30010000  1806.31

So this does look like an improvement.

CI doesn't like the compile time increase, though.

@chriselrod
Copy link
Collaborator Author

chriselrod commented May 24, 2021

Full vector of results for max gflops on this PR (top) vs master (bottom):

[40.34582132564841, 49.10519317564656, 57.83779726363916, 56.899033139961325, 64.68719706242351, 75.34026894951074, 89.35570454467579, 65.68873726774008, 73.21357336307155, 71.45129465903896, 80.17237059678308, 85.8050032071841, 93.16683822947185, 90.7559175828576, 108.24178126757612, 74.94998312683796, 77.28818401991094, 80.56081351927256, 81.40020992946764, 82.38276149992409, 87.6656092554581, 92.45198115821557, 96.93934419971752, 79.64347518566545, 85.09949592340742, 84.69135802469137, 95.67575392038603, 85.67103594080336, 88.12655584999598, 89.5651517439227, 107.1309005691329, 86.56242150213512, 90.79968135302408, 86.86441603845732, 99.70037453183521, 92.06870421823692, 171.66843033509699, 166.9582696791831, 188.33787465940054, 150.68715978226064, 160.63740923986379, 160.71117034165255, 171.73496183206106, 170.60333467025725, 156.36941410129097, 166.23646960865946, 240.22433486081667, 205.47320537002108, 208.77641645711842, 215.73424369747897, 231.68507990990025, 222.73662977702668, 228.28352490421454, 238.11480266638455, 334.7516281445537, 281.21959961087504, 290.1883025850951, 281.7451990632319, 302.62945139557263, 296.6965378825891, 295.2230669918233, 305.6456020495303, 372.8751248751249, 322.315581127733, 334.4658840792369, 340.46583572453375, 376.4444020962363, 364.30349780555923, 352.390099009901, 374.09695232474814, 453.54330708661416, 374.92898016775115, 372.95193716883995, 371.0944712611041, 394.33090773005114, 374.4521931328836, 402.08992493085736, 399.8500189753321, 486.669152945844, 400.97774617845715, 402.70680845187127, 406.758518318602, 443.74427467322005, 427.87791741472176, 440.5842920133939, 438.3587908225219, 574.5272129550713, 427.5959440465832, 436.87835283975994, 420.9931508972015, 465.1317319512276, 422.91066350016126, 444.4451358142874, 437.04709529047096, 580.4251805985554, 437.22460027697343, 452.2559331687867, 448.3094751608673, 491.4510874865893, 505.1143470064356, 527.4768824306474, 511.8858426125198, 679.7882076449852, 513.5017052700258, 531.1621403603119, 539.3333136321996, 559.1436162273501, 554.6723681989534, 523.5573484646984, 570.0774405740768, 616.1106265226938, 614.5884327575354, 608.7226624405706, 1180.2132779248634, 1000.7274872686625, 1039.5332385353715, 990.243095329006, 1033.44340654191, 1041.9890901156448, 1110.9077076139756, 1378.8460243721806, 1179.8700013374348, 1396.1375046006626, 1078.7941747572816, 1148.9709507985854, 1266.1745549283544, 1543.8033951509647, 1364.4090349075977, 1630.2863065760685, 1271.4739730583735, 1326.2120675784392, 1381.8868163136265, 1377.409237536657, 1814.3300027005132, 1182.3058217865162, 1238.006864006864, 1216.40015789214, 1274.3684663986214, 1340.8219489120152, 1762.7403212758582, 1422.1795617270557, 1500.6189967982925, 1436.8894148184909, 1811.5540351982715, 1434.4339698224064, 1464.670990192977, 1483.6032535828142, 1271.1574801258496, 1205.166188807476, 1263.4390309223131, 1700.7318212487671, 1282.2539513733543, 1343.5076653682593, 1392.3679180180804, 1355.0662279670973, 1351.1550805262311, 1404.8845530765952, 1850.529181389358, 1425.5430098797196, 1516.5253527063549, 1535.7374614310686, 1534.3753743541363, 1508.3534476702762, 1846.7677154081387, 1412.25448122465, 1274.4416950158366, 1259.481971207228, 1301.7372236007382, 1317.7718932467787, 1677.7200681955671, 1436.729211531401, 1705.5074081037317, 1448.3545645618206, 1728.6501020079459, 1425.3967938433875, 1498.6647460589777, 1381.8697597221692, 1410.1036025289275, 1459.9232017306651, 1484.1826499285692, 1625.0189843011449, 1282.7791586496214, 1399.3890877959468, 1421.2363210527128, 1327.8257755601398, 1537.3181157433942, 1446.0510550428646, 1894.4646630240852, 1536.9515011547344, 1983.239557473163, 1881.277700529956, 1911.6987430462405, 1976.590361841889, 2053.2189763710594, 1666.609418369427, 1676.8946626607965, 1721.996191765017, 1775.6099529197468, 1809.8680087888379, 1828.7828577744554, 1793.6098315966253, 1824.0356457219523, 1868.1032043343773, 1897.8684709417257, 1952.1828836741668, 2002.066195275048, 1863.5254631773832, 1866.7339836488952, 1850.5189626183196, 1860.6231879018646, 1807.0010117869076, 1648.893865094042, 1680.3100408152434, 1671.7127358580833, 1509.4748508112873, 1550.9174850285601, 1526.240797396785, 1589.8528524683322, 1614.8395136547858, 1655.4706383714376, 1623.9052077471797, 1620.367948917266, 1659.6982510286646, 1703.0939708897163, 1618.6605744920648, 1603.556403845013, 1654.2378101769687, 1506.0267417976777, 1542.6900379974843, 1553.4188496768918, 1577.3472114276676, 1621.5751060066825, 1652.1588400714031, 1593.8797653807178, 1656.2655618263411, 1691.0596295545129, 1696.1842096071011, 1692.4884194505992, 1705.9797505739668, 1715.043893049753, 1707.6829577090325, 1685.573877471744, 1745.0100110726469, 1610.1383028338794, 1633.749395415954, 1676.0898043365132, 1676.336815881731, 1739.8883922572516, 1708.9476454393873, 1728.2104501636875, 1788.5291291318058, 1747.6995350634618, 1764.619618815973, 1716.8947774098972, 1772.4279526483185, 1652.0787144766114, 1669.9998629171864, 1690.1282120776746, 1705.003374425393, 1713.0585039179089, 1720.5239107606383, 1750.8437904700693, 1729.562469706851, 1691.0325637976673, 1716.0675686603429, 1730.367186300895, 1759.4191262225352, 1781.0623973370289, 1800.671893693608, 1807.8240382674405, 1788.4893566798794, 1798.984050985084, 1728.394416415473, 1707.7266490627967, 1749.9953528934366, 1727.5007540837853, 1726.0688046616704, 1709.4721470318502, 1758.60524619606, 1789.7958445466677, 1816.431788944306, 1810.031155346502, 1828.6982237369236, 1791.2057693135548, 1741.7784915908455, 1726.8981546550692, 1715.797318378928, 1728.3130407139568, 1762.0525930469937, 1751.5552578843751, 1754.1696481870924, 1769.715831204492, 1753.060144421134, 1785.0073453280284, 1751.5693681138878, 1768.8478700057735, 1781.0860222393733, 1833.0227057421198, 1829.0339107782168, 1842.3120988099997]
[41.197564840296884, 47.06509747551882, 57.52872374688523, 54.11964809384165, 63.74684829511803, 73.57137150618017, 88.23251762097094, 63.30346550856997, 69.59713226732019, 68.47986155599743, 80.23700369076975, 86.52598702650756, 91.47997473152243, 90.12804533910264, 110.29120580235721, 74.45910511380644, 76.35200421246627, 82.36180228648284, 83.94147307939969, 81.97983193277312, 90.47184285082312, 94.9393085111398, 96.74003608331967, 79.56531365313654, 84.88097809049427, 84.25152642290686, 97.47813053549531, 84.28119800332777, 86.89152810768013, 89.22833935018052, 109.33629452464336, 89.27012499190467, 93.3686200378072, 96.67092224451333, 101.18066278655422, 94.62125538653237, 172.23038131469522, 165.6794063671906, 188.80409731113957, 150.8513912040005, 161.14477246358126, 160.66250832677287, 173.7295360474455, 164.26900584795322, 149.79927065165688, 164.763974471831, 239.83065892796176, 213.03692626251006, 217.85618579723092, 217.26330265524172, 241.57020634121793, 223.74906900328585, 232.590761223162, 239.48313291476006, 350.3427998663548, 279.56396335256187, 286.9698885376809, 282.99707266074233, 299.28486066310614, 304.97508896797154, 311.1895161290323, 306.86853386681906, 365.78994936571024, 297.5652728199898, 324.72762888433795, 306.6571093970844, 344.67035986913845, 326.62641599427644, 340.5196731114212, 346.60035149384885, 449.75843053047674, 325.6378676470589, 354.19311840044963, 341.35263609566806, 368.611342169705, 341.79769027410595, 385.6053349499848, 375.9517573595005, 483.67556484365764, 353.1529681182238, 376.5131217921818, 416.351945854484, 465.82696477978016, 440.9398704902868, 451.6651599089148, 453.9077493216862, 591.1787847149718, 433.1329491525423, 447.0007463192889, 430.13793103448273, 463.1773969430292, 419.94334459066033, 460.6651576695296, 439.94415207200996, 593.4000659413123, 431.7885117493472, 450.0060453400504, 451.50391596793514, 495.1049390803091, 516.4622133599203, 537.8398660740057, 543.2856197033897, 706.2221105166781, 532.4670643951042, 550.3677057858404, 536.4165417511683, 581.1769524341432, 590.34624827834, 538.7795383352088, 568.8867671192004, 637.2917055565925, 638.2221576673082, 637.6358030965207, 1206.3710605645383, 990.3029222874354, 999.5580242693555, 1052.0785418486068, 1060.3760097448392, 1053.8311817279048, 1114.233388119277, 1385.5625465124144, 1186.814381327145, 1406.073689673067, 1088.066804482646, 1126.6291780533952, 1234.4163403534765, 1536.4419780490814, 1326.9439840239643, 1634.867878041269, 1336.9542712249715, 1324.8264887888774, 1359.324521847302, 1352.131126304426, 1822.3681735985533, 1193.9451357778883, 1241.6022372808432, 1239.082328106152, 1288.8036595991869, 1373.8355951919348, 1785.9600725952812, 1414.5070349589987, 1502.414839509339, 1486.9520837448156, 1834.3419169591175, 1425.6296738661626, 1444.0675587161973, 1432.8846939656414, 1391.173189643843, 1232.8419657599723, 1288.3617074912818, 1726.7673174716094, 1312.3323449932443, 1361.4113706319029, 1398.0601094789358, 1353.560504569926, 1355.6086548885178, 1392.8306400484653, 1899.9026412666644, 1438.0441493315861, 1537.55034628389, 1432.5605629486956, 1514.2096112416236, 1508.5351131630453, 1919.127054594794, 1255.3646916864745, 1290.8671396120915, 1282.7425606296567, 1343.0747228633045, 1352.332535176207, 1726.5135653293526, 1460.205414376333, 1719.2499801307092, 1431.8200819711105, 1767.0826797798131, 1438.9452678734992, 1499.2531785449733, 1471.0423372728355, 1446.1927702654482, 1480.5527006721256, 1512.591733243291, 1624.0525110972749, 1557.8120279286097, 1438.8749751825724, 1460.9623749830296, 1324.6278063993657, 1559.0355993597843, 1554.360889010677, 1972.8378479938021, 1580.0707432890786, 1896.2289718086888, 1752.9133647911913, 1764.693057233862, 1778.1117361394543, 2051.76242442751, 1662.634324092426, 1688.159421593669, 1714.3281030553676, 1770.9373437126496, 1786.9480600848958, 1829.7477898789434, 1721.119801363574, 1757.94742876309, 1820.871418119373, 1802.443499466438, 1828.8944489680907, 1847.9822479228749, 1622.464103546999, 1605.7200598816519, 1586.1483108212356, 1476.1265230791555, 1431.1423292719423, 1423.6637388724039, 1440.8753148784306, 1516.202504305341, 1486.975507808609, 1545.9871410067187, 1502.100632211207, 1560.6060656381044, 1595.750893440515, 1625.2396842937449, 1612.1541710709962, 1620.4643951456683, 1673.7117284056949, 1675.291435844321, 1612.6254199198218, 1603.884414958813, 1649.3346203244282, 1502.3585454989409, 1522.8380341702248, 1524.382633878463, 1576.86892432821, 1616.6691458775783, 1610.2294854474687, 1589.946782996978, 1651.1534967603236, 1678.7365828640532, 1697.0499208859226, 1729.203170817776, 1714.3287836968639, 1711.903652199903, 1709.4336696830592, 1596.6552860700224, 1692.8925858238686, 1601.4781962248871, 1611.1468119512679, 1685.1565627153964, 1672.3507638824317, 1714.5495219556565, 1683.337783580443, 1733.5601753445228, 1769.7243994742817, 1741.6116681735366, 1740.238607295019, 1668.580615658366, 1730.5267070927782, 1644.754536855583, 1632.3595471457961, 1656.8051498942423, 1705.8379493216528, 1721.541056089482, 1721.8175073525333, 1726.9177688042128, 1696.1839698203942, 1666.9182534252132, 1590.6112184953847, 1600.4044074046271, 1725.5059412008961, 1755.179633003724, 1745.3052987118758, 1696.6948858697826, 1779.4282791133296, 1759.7541736422993, 1657.6515500160326, 1696.742053268825, 1651.5758122179027, 1635.1223112830924, 1616.5458194130292, 1636.0013653463366, 1687.9191748889323, 1769.9180002123007, 1741.4637015129506, 1751.0792423190032, 1783.1455649030156, 1773.48881743802, 1733.5962208525084, 1670.6004761652978, 1649.3732137075401, 1660.9188544064284, 1742.8593385914367, 1736.856510067945, 1733.3772188736275, 1737.558047945733, 1746.4789194619843, 1741.7324508484026, 1690.6337063771164, 1599.1374048613611, 1735.7815797082878, 1632.2066478631368, 1747.5176477354014, 1806.30705869422]

Focusing on the largest 150 (as this is when packing starts to matter), on this PR:

julia> rb.sizes[150] # size range: 313:10_000
313

julia> hp = rb.gflops[150:end,1,BLASBenchmarksCPU.get_measure_index(:minimum)];

julia> using StatsBase, Statistics

julia> summarystats(hp)
Summary Stats:
Length:         151
Missing Count:  0
Mean:           1666.606157
Minimum:        1259.481971
1st Quartile:   1565.383031
Median:         1705.979751
3rd Quartile:   1774.018953
Maximum:        2053.218976

On master:

julia> hp = rb.gflops[150:end,1,BLASBenchmarksCPU.get_measure_index(:minimum)];

julia> summarystats(hp)
Summary Stats:
Length:         151
Missing Count:  0
Mean:           1631.788354
Minimum:        1255.364692
1st Quartile:   1541.768744
Median:         1657.651550
3rd Quartile:   1733.578198
Maximum:        2051.762424

@codecov
Copy link

codecov bot commented May 24, 2021

Codecov Report

Merging #91 (95ed1fe) into master (585e3e4) will decrease coverage by 1.26%.
The diff coverage is 79.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #91      +/-   ##
==========================================
- Coverage   87.86%   86.60%   -1.27%     
==========================================
  Files          11       11              
  Lines         651      724      +73     
==========================================
+ Hits          572      627      +55     
- Misses         79       97      +18     
Impacted Files Coverage Δ
src/types.jl 100.00% <ø> (ø)
src/memory_buffer.jl 68.42% <50.00%> (ø)
src/utils.jl 67.64% <62.00%> (-22.98%) ⬇️
src/macrokernels.jl 88.23% <87.09%> (+13.23%) ⬆️
src/matmul.jl 90.74% <91.66%> (ø)
src/block_sizes.jl 96.66% <100.00%> (+1.75%) ⬆️
src/funcptrs.jl 97.05% <100.00%> (ø)
src/global_constants.jl 60.00% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 585e3e4...95ed1fe. Read the comment docs.

@DilumAluthge
Copy link
Member

The non-coverage CI jobs seem to complete quickly. It's just the coverage jobs that are really slow. Maybe decrease the size of the matrices in the coverage jobs?

@DilumAluthge
Copy link
Member

And maybe also decrease the number of different matrices that we test in the coverage jobs?

@chriselrod
Copy link
Collaborator Author

I think lots of different sizes is fine. For example:

 ┌ Info: 
│   T = Int64
│   n = 10
│   k = 10
└   m = 10
140.297737 seconds (13.23 M allocations: 730.991 MiB, 0.26% gc time, 100.00% compilation time)
 78.725758 seconds (11.10 M allocations: 568.837 MiB, 0.34% gc time, 100.00% compilation time)
493.027681 seconds (76.36 M allocations: 3.424 GiB, 0.33% gc time, 100.00% compilation time)
629.243766 seconds (48.56 M allocations: 2.216 GiB, 0.13% gc time, 100.00% compilation time)
  0.100986 seconds (124.12 k allocations: 6.720 MiB, 99.93% compilation time)
  0.092344 seconds (95.78 k allocations: 5.109 MiB, 99.92% compilation time)
  0.102784 seconds (114.68 k allocations: 6.104 MiB, 99.93% compilation time)
  0.095981 seconds (97.39 k allocations: 5.169 MiB, 99.92% compilation time)
┌ Info: 
│   T = Int64
│   n = 10
│   k = 10
└   m = 20
┌ Info: 
  0.000041 seconds (1 allocation: 1.766 KiB)
  0.000035 seconds (1 allocation: 1.766 KiB)
  0.000026 seconds (1 allocation: 1.766 KiB)
  0.000038 seconds (1 allocation: 1.766 KiB)
  0.000018 seconds (1 allocation: 1.766 KiB)
  0.000034 seconds (1 allocation: 1.766 KiB)
  0.000019 seconds (1 allocation: 1.766 KiB)
  0.000026 seconds (1 allocation: 1.766 KiB)

It's the initial compilations that take an eternity.
So perhaps, for the coverage test, we only do 1 case, e.g. A' * B'. The adjoints should encourage it to pack at smaller sizes, so that we hopefully don't lose on coverage % (while still testing all the cases when not taking coverage)?

@MasonProtter
Copy link
Member

I’m away for a few days so I can’t review this right now, hopefully on Wednesday or Thursday. Feel free to merge without me if you’re in any hurry.

@chriselrod
Copy link
Collaborator Author

chriselrod commented May 24, 2021

No rush here.
Worth doing more tests to see if it's worth the compile time.

Haswell (AVX2), master vs this (TileMajor) branch, sizes = logspace(100, 2000, 100):
image
This is a sizeable improvement. The two runs on the master branch:

julia> using BLASBenchmarksCPU

julia> rb = runbench(sizes = logspace(100, 2000, 100), libs = [:Octavian]); plot(rb, displayplot=false);
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:20:41
  Size:      (454, 454, 454)
  Octavian:  (MedianGFLOPS = 40.86, MaxGFLOPS = 41.51)

julia> hp = rb.gflops[:,1,BLASBenchmarksCPU.get_measure_index(:minimum)];

julia> using StatsBase

julia> println(hp)
[42.37647258242224, 35.16418342719228, 37.85569893840188, 37.99437648973067, 35.414598826792336, 40.88202092691295, 42.95141866447933, 44.51504751231584, 39.723519373230424, 41.45207298072225, 42.20957462321687, 43.32831582853363, 44.831901987868605, 42.8457085459015, 43.40069192411858, 41.82221285609303, 42.05249232199643, 43.10170973787103, 45.53156192257273, 42.137858122167806, 43.26286028929034, 40.51336689809953, 40.65206128339871, 43.78083705784357, 40.70295162311738, 39.73061078494253, 43.92594303818356, 38.64710737022547, 41.339593968351494, 46.41472754346764, 44.010970287602156, 46.27648582375056, 43.33680428997853, 44.207221304748124, 44.79812579269643, 47.2699554764025, 43.99761690035822, 43.58239857750304, 45.82918823454229, 43.1490943600186, 44.94589390866373, 44.37015498374751, 44.63075968118071, 44.14869266597895, 41.84497740568536, 41.90424520754501, 41.67572514147752, 42.44752437261904, 41.835510203971964, 42.64967302034176, 41.510444964538756, 42.48728561436232, 41.27185072718585, 40.3471853326552, 40.62923650281756, 38.5291935337586, 39.327331400220345, 35.457896197601954, 37.033878998139585, 38.886809704953684, 37.865870641427286, 38.3052633462779, 37.82841842760876, 37.77468990631525, 38.58001782327715, 37.94277275081719, 37.97943526214185, 38.20715236549502, 38.52666248159736, 38.778732472220646, 39.605602092033784, 37.9997894209314, 38.040672966430414, 37.55407015883109, 38.270169261747206, 38.538421255907046, 37.2394004980437, 37.921141786042384, 37.465292278101394, 38.075449655027406, 37.772486677390056, 38.593209261538476, 38.37067290154179, 38.47220509194456, 36.741403955331776, 37.60339163168495, 37.01897005269705, 37.217841562648516, 36.76074944636935, 37.51168253198273, 37.83469357085086, 36.94479959715498, 37.29216669328569, 37.65485451725404, 37.30241833164937, 39.16134318345523, 37.939201468337984, 38.348214747341125, 38.81275862703746, 38.67568498554023]

julia> summarystats(hp)
Summary Stats:
Length:         100
Missing Count:  0
Mean:           40.394814
Minimum:        35.164183
1st Quartile:   37.941880
Median:         39.664561
3rd Quartile:   42.872136
Maximum:        47.269955
julia> println(hp)
[42.86969755428375, 35.455126541207015, 37.713652412089736, 38.06827119710555, 35.43025168815224, 40.17646907415511, 43.00004976857612, 44.7590586302013, 44.7171969655624, 41.76589380596738, 42.19799161314112, 43.44045096485127, 48.734845764648284, 43.25359413463911, 43.561957235641835, 45.12152834456545, 42.64149883655622, 43.58043613532265, 45.84104790904668, 42.546986688393744, 43.4127210132608, 46.32565272583799, 41.90994438289887, 41.97560736069472, 46.19609691565713, 39.860733230487476, 41.5762912058413, 40.276119064483375, 40.62040728508189, 42.96296526507659, 41.59415341823239, 41.06315915715995, 42.40297100207802, 41.354191362635696, 43.59660037078559, 44.86680471622365, 43.521213168294956, 43.8451954490853, 44.37604428815427, 41.81479164852218, 42.37375772415911, 42.22722022580727, 43.350611566436385, 40.719983129057354, 40.4168785944354, 41.518749545049864, 40.648238868315644, 40.322824055346906, 40.95377776187999, 41.34874656272526, 40.762183298278266, 42.07416723482104, 39.80731553417903, 38.604234459059164, 37.789315229517946, 39.67664486172888, 39.57061377285065, 37.113196749456336, 38.58319014534991, 39.211551761372455, 38.41774273989768, 38.367298202356395, 38.428357551654244, 38.030873604929894, 38.548285606307765, 37.94324932534827, 38.3383974827697, 38.39594013069244, 38.5611875866133, 38.865200837354756, 39.8181065971093, 38.168008774807014, 38.42472619541092, 37.81404724042411, 38.560566226498096, 38.89402823187056, 37.66463617616825, 38.68052839993805, 38.39588547752412, 39.33193729299354, 38.51278780254479, 39.780079235168486, 39.36258682868399, 39.6607697449674, 37.891333841145055, 38.295777363233725, 38.34134832374149, 38.20565042633544, 37.29925554781007, 37.74107368613238, 37.71857389317153, 37.123090982794096, 37.69873237754014, 38.1351988743486, 37.63568385842445, 39.774978991085426, 38.62631128660234, 38.946841261264744, 40.14377290597999, 40.08508118609371]

julia> summarystats(hp)
Summary Stats:
Length:         100
Missing Count:  0
Mean:           40.461288
Minimum:        35.430252
1st Quartile:   38.395926
Median:         39.839420
3rd Quartile:   42.263855
Maximum:        48.734846

vs this branch:

julia> println(hp)
[42.80455440458865, 35.34960533126294, 37.59876250907598, 38.00034260263804, 35.32875471328534, 43.04851209354919, 45.48086539979997, 46.05816987148516, 44.72500791493357, 43.123880224818244, 43.93095320995259, 44.25106070966626, 49.10471397913121, 45.583604598024394, 44.9103379958495, 45.011055346519115, 44.743977520285426, 45.72954795381352, 48.155507395875716, 45.13733477928554, 45.93947685030753, 46.10215683395758, 45.140949893614106, 45.84193605749011, 46.238240706469895, 44.845903069594335, 47.35464618545606, 45.16789892061206, 45.398494054818414, 47.54714644892569, 45.443010746281104, 47.56201664663325, 44.69868138321463, 45.12677777639215, 46.3079376597953, 48.160595317179585, 44.97507819715641, 44.77041203768809, 47.03003457072018, 44.337490926656386, 45.40106693556422, 45.176957663373784, 45.69699695137391, 44.64843516425153, 42.77033193790125, 43.23353206128012, 42.98686866882247, 42.696863299725656, 42.16607004797231, 43.85851325180056, 42.33991047581385, 43.91372104142068, 42.277837996611886, 41.31861519913892, 41.59416890828179, 42.458096059606, 41.23080004266778, 39.95500293849217, 40.23405526968252, 40.65087409213754, 39.77706064580542, 40.43808115480828, 40.109724543811346, 39.469699985122205, 40.53585585790983, 41.065900374661894, 40.95652776211147, 39.6994287717469, 40.42794084071988, 40.91448728920025, 41.70213415475231, 40.948693190990916, 40.748289070288884, 40.87266889259652, 40.26759294799651, 39.088703517160006, 39.647651707700106, 41.024139947028345, 39.763321417210705, 41.409094878315514, 40.41653945085005, 41.619999921074054, 41.27011025485605, 41.61247842612918, 41.06412085180452, 39.97633603029121, 40.12840600625785, 40.263057818890054, 41.02314463714022, 40.55953774045865, 39.43351983844263, 40.12090515057289, 40.29688911811543, 40.92445177610636, 39.745189469108155, 42.54329994931017, 41.67113786776448, 41.14490615021232, 42.2256855179169, 43.7523330931622]

julia> summarystats(hp)
Summary Stats:
Length:         100
Missing Count:  0
Mean:           42.693332
Minimum:        35.328755
1st Quartile:   40.553617
Median:         42.308874
3rd Quartile:   45.039986
Maximum:        49.104714

So it looks like >5% improvement.

It's expected that this layout will help older CPUs / those with less sophisticated prefetchers more.

@chriselrod
Copy link
Collaborator Author

chriselrod commented May 24, 2021

For comparison, this branch vs MKL on the 2-core Haswell laptop from 100 to 4000:
image

julia> hpoct = rb.gflops[:,1,BLASBenchmarksCPU.get_measure_index(:minimum)];

julia> hpmkl = rb.gflops[:,2,BLASBenchmarksCPU.get_measure_index(:minimum)];

julia> using StatsBase

julia> summarystats(hpoct)
Summary Stats:
Length:         200
Missing Count:  0
Mean:           42.762651
Minimum:        36.700316
1st Quartile:   40.929725
Median:         42.120759
3rd Quartile:   44.942970
Maximum:        48.519038


julia> summarystats(hpmkl)
Summary Stats:
Length:         200
Missing Count:  0
Mean:           43.968500
Minimum:        36.772918
1st Quartile:   41.810847
Median:         43.966889
3rd Quartile:   46.558055
Maximum:        49.509909

Octavian's performance for large matrices is still fairly bad on this laptop, and it lags a good bit behind MKL on my desktop as well.

The minimum time above hovered around 30 GFLOPS, although we had 35 GFLOPS for that same range earlier on both this branch and master. =/
I'm not convinced that I'm benchmarking anything other than noise here.
Unfortunately, it's hard to benchmark the two Octavian implementations side by side.

Eventually, I'll have to really try and dig into find out the cause.
Possibly something in syncmul!.

@chriselrod
Copy link
Collaborator Author

This PR though is about using a tile-major data layout for Apack instead of a column-major.

We're all familiar with column major.
For reference, "tile major" means that Apack's data is organized in tiles.
On AVX2, you can convert a 200x200 matrix to be tile major via:

A = rand(200,200);
Atm = permutedims(reshape(A, (8, cld(size(A,1), 8), size(A,2))), (1,3,2));

Basically,

A[1:8,:]
A[9:16,:]
A[17:24,:]

etc. We order A's data into tiles from the larger A.

What's the advantage of this?

With AVX2, the microkernel is 8x6.
This means the inner most level of the computation calculates an 8x6 block of C.

C[(1:8) .+ m, (1:6) .+ n] = Apack[(1:8) .+ m, :] * Bpack[:, (1:6) .+ n]

In other words, we calculate this 8x6 block of C by multiplying an 8 x Kc block of Apack with a Kc x 6 block of Bpack.

If we make all data in the 8 x Kc block of Apack be contiguous, that'd theoretically be friendlier to the hardware prefetchers.
There are streaming prefetchers that look for a sequential loads to addresses.
It has the advantage of placing all memory close together, which would also be friendlier to the strided-prefetchers that look at a specific load instruction and monitor if it's accessing data across a specific stride. These prefetchers want a small stride, and also will not prefetch across pages. A page is only 4 KiB, and the L2 cache is 256 KiB on Haswell (while on Tiger Lake, where it is 1_280 KiB), so there are many such page boudaries. Decreasing the stride helps.

Still, LoopVectorization is emitting software prefetch instructions for Apack here, because software prefetches still helped performance on my Cascadelake desktop.

What's the advantage of column-major?
It's not much worse, and lets us re-use the same kernel as for small matrix operations before packing becomes profitable. Compiling less means less time compiling.

matmul_params(::Val{T}) where {T <: Base.HWReal} = LoopVectorization.matmul_params()

function block_sizes(::Type{T}, _α, _β, R₁, R₂) where {T}
function block_sizes(::Val{T}, _α, _β, R₁, R₂) where {T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to wrap the type in a Val?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To throw an error if not specialized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to, but I think I prefer it as a style.
It should force specialization, just like Array{T} always specializes on T.

end
end

if !pack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's if !pack?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
pack means it is packing Apack, and !pack that it's using Apack.
When pack, it only evaluates a single n_r tile of Bpack, while with !pack, it needs to evaluate all of those remaining.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do it in this awkward way because I haven't gotten around to adding "tile major" memory layout support to LV, and think this should wait until after the rewrite.

So the code is written to manually iterate over m_r and n_r.
The difference between tile major and a m_r x K x (M / m_r) array is the reaminders. The last iteration along the (M/m_r) axis will only be partially filled, i.e. we won't have a full m_r iterations on the first axis.

@chriselrod chriselrod enabled auto-merge (squash) May 28, 2021 16:54
@chriselrod
Copy link
Collaborator Author

chriselrod commented May 28, 2021

I'll merge this. It does hurt compile times by a lot more than it helps benchmarks, but it does help benchmarks.
If someone wants good compile times with their matrix multiplication, they should just use LinearAlgebra.mul!.

We can get complex numbers to use this later.

@chriselrod chriselrod merged commit a856f5f into master May 28, 2021
@chriselrod chriselrod deleted the tilemajor branch May 28, 2021 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants