Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream master #477

Merged
merged 415 commits into from
Sep 21, 2019
Merged

Merge upstream master #477

merged 415 commits into from
Sep 21, 2019

Conversation

rohithkrn
Copy link

No description provided.

ShahriarSS and others added 30 commits September 11, 2019 16:39
Summary:
This PR adds Average Pool module to C++ front-end.
Pull Request resolved: pytorch#25800

Differential Revision: D17318094

Pulled By: yf225

fbshipit-source-id: c914c0e802bbe5f1d1f0a21a669c28bc956899db
Summary:
Just a tiny fix to make debugging easier (output errors to stderr and include in the exception message)
Pull Request resolved: pytorch#25809

Reviewed By: zrphercule

Differential Revision: D17329957

Pulled By: houseroad

fbshipit-source-id: 0d73dd9f62c735fbc5096e6a7c0e5f58e4cd90ae
…ytorch#25862)

Summary:
This change adds a new prepack and run function for FC and Convolution operators in QNNPACK.
The new functions added are `PackBMatrix`, `qnnpackLinear`, `PrePackConvWeights` and `qnnpackConv`
Pull Request resolved: pytorch#25862

Test Plan:
QNNPACK unit tests
fully-connected-test
convolution-test

Differential Revision: D17299260

Pulled By: supriyar

fbshipit-source-id: fdc4e2d5f1232675acd153f3efb9d17ed8628a54
Summary:
Enable mGPU tests that pass on ROCm as of 2.7.
Pull Request resolved: pytorch#26055

Differential Revision: D17331484

Pulled By: bddppq

fbshipit-source-id: 51f956a84a6c14a1a41473d322950994fa29c25c
Summary:
Pull Request resolved: pytorch#26075

att, remove verbose argument to reduce noice in the logs

Test Plan:
ci

Imported from OSS

Differential Revision: D17335935

fbshipit-source-id: 2e4289e838bf4489dcad8d5533353eebcff0d481
Summary: Pull Request resolved: pytorch#25877

Test Plan: Imported from OSS

Reviewed By: jianyuh

Differential Revision: D17275746

Pulled By: jamesr66a

fbshipit-source-id: db2f38ddd99f02ccb4fb754fa1c1e6cad4425fa8
Summary:
Pull Request resolved: pytorch#26064

Just changing the names after pytorch#25678.
ghstack-source-id: 89944542

Test Plan: CI

Differential Revision: D17332068

fbshipit-source-id: 5e9febed7a2fcd10d44273e55643b277d33a3ad7
Summary:
Pull Request resolved: pytorch#25976

As recommended in https://github.com/pytorch/pytorch/pull/25877/files#r322956051:

> We should move more of these toward using BytesIO. Using files in tests is generally considered bad practice because it introduces syscalls and dependencies on the execution environment, and thus can cause test flakiness/instability.
ghstack-source-id: 89929947

Test Plan: CI

Differential Revision: D17310441

fbshipit-source-id: ba97cce4224225df45ff44062f1bc8ebefb25922
…6079)

Summary:
Pull Request resolved: pytorch#26079

This reverts commit e303961.

Test Plan: Imported from OSS

Differential Revision: D17337585

Pulled By: jamesr66a

fbshipit-source-id: 4b93a4c5ca2fe491d609da889a42d22be8e52889
Summary:
Pull Request resolved: pytorch#25680

Add a runtime flag to choose between FBGEMM and QNNPACK when compiled with both.

The flag can be set by using torch.backends.quantized.engine = torch.fbgemm/torch.qnnpack or ctx::setPreferredQuantizedEngine(at::QEngine)
ghstack-source-id: 89935643

Test Plan: Verified torch.backends.quantized.engine works

Differential Revision: D17198233

fbshipit-source-id: e5449d06f4136385e0e6d18bd4237f8654a61672
Summary:
Pull Request resolved: pytorch#25734

[pytorch] Dynamic registration of RPC backends
Allow non-pg rpc backends to be plugged in as a backend.
ghstack-source-id: 89938296

Differential Revision: D17183789

fbshipit-source-id: 885fed12d80b82b60f9a125f78302a161e708089
Summary:
Enable one unit test that passes now.
Pull Request resolved: pytorch#25956

Differential Revision: D17298150

Pulled By: bddppq

fbshipit-source-id: 8763e71ad7ef80be915fe93a3471b29f27f3f0a4
)

Summary: Pull Request resolved: pytorch#26030

Test Plan:
- [namedtensor ci]

Pull Request resolved: pytorch#26030

Differential Revision: D17322383

Pulled By: zou3519

fbshipit-source-id: d5b914d646b48a6f4e0104aceb435e694b72bd96
Summary:
Pull Request resolved: pytorch#26050

Throws a warning once when someone attempts to attach names to a tensor.
This is guaranteed to happen at the callsite `set_named_tensor_meta`.

Test Plan: - run tests [namedtensor ci]

Differential Revision: D17331634

Pulled By: zou3519

fbshipit-source-id: 44f5e5c95acd9c7ba543c1210a3b1314aab348f0
Summary:
While this isn't ideal as it might print out the same source every time a function is run; it's still easier to go and tweak python code to reduce loop counts, than to insert `std::cout` and recompile cpp code.
Pull Request resolved: pytorch#25868

Differential Revision: D17318386

Pulled By: Krovatkin

fbshipit-source-id: 928ba6543204042924ab41a724635594709630de
Summary:
Was recently enabled in pytorch#26055, it's flaky on master:

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/37575
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/37577
```
05:39:35 test_stream_event_nogil (__main__.TestCuda) ... Exception in thread Thread-3:
05:39:40 Traceback (most recent call last):
05:39:40   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
05:39:40     self.run()
05:39:40   File "/usr/lib/python2.7/threading.py", line 754, in run
05:39:40     self.__target(*self.__args, **self.__kwargs)
05:39:40   File "test_cuda.py", line 1894, in _test_stream_event_nogil
05:39:40     c2p.put(sync_func(self, TestCuda.FIFTY_MIL_CYCLES))
05:39:40   File "test_cuda.py", line 1882, in _event_wait
05:39:40     self.assertTrue(s1.query())
05:39:40   File "/usr/lib/python2.7/unittest/case.py", line 422, in assertTrue
05:39:40     raise self.failureException(msg)
05:39:40 AssertionError: False is not true
```
Pull Request resolved: pytorch#26087

Differential Revision: D17340891

Pulled By: bddppq

fbshipit-source-id: b2b70beb1b068db53197a5f9f6a80cb046e66ebd
Summary: Pull Request resolved: pytorch#26084

Test Plan: Imported from OSS

Differential Revision: D17339315

Pulled By: jamesr66a

fbshipit-source-id: 03a2674edcf779becfe3b8ec96f1bae23c74b11c
…db3244 (pytorch#25959)

Summary:
Pull Request resolved: pytorch#25959

Previous import was 28ca699b69b5a31892619defca2391044a9a6052

Included changes:
- **[7988d836](onnx/onnx@7988d836)**: Supporting negative axes for all existing onnx ops (pytorch#2281) <Negin Raoof>
- **[5ca0a09e](onnx/onnx@5ca0a09e)**: Update managingexperimentalops.md (pytorch#1981) <Joseph Spisak>
- **[bc0495c1](onnx/onnx@bc0495c1)**: Fix link to community docs in readme (pytorch#2261) <Prasanth Pulavarthi>
- **[2fdb3ef6](onnx/onnx@2fdb3ef6)**: move map and sequence types to onnx domain, (pytorch#2244) <Ke Zhang>
- **[568b65aa](onnx/onnx@568b65aa)**: Improve compatiblity with proto3 and enable reading attributes (pytorch#2288) <Dmitri Smirnov>
- **[1f350f2c](onnx/onnx@1f350f2c)**: Remove type info for loop variadic input in Loop op used to compose the Range op (pytorch#2287) <Hariharan Seshadri>
- **[eb139446](onnx/onnx@eb139446)**: Add Foundation WG to working-groups.md (pytorch#2276) <Ryan Loney>
- **[4eabc4b3](onnx/onnx@4eabc4b3)**: Fix testdata model for CumSum. Add exclusive attribute. (pytorch#2271) <jignparm>
- **[1a62afdb](onnx/onnx@1a62afdb)**: Support GatherND operator in ONNX (pytorch#2106) <Hariharan Seshadri>
- **[0e330e9d](onnx/onnx@0e330e9d)**: Support ScatterND operator in ONNX (pytorch#2220) <Bowen Bao>
- **[733f7a6a](onnx/onnx@733f7a6a)**: Add Det to ONNX (pytorch#2233) <Bowen Bao>
- **[52187738](onnx/onnx@52187738)**: Update the description of nearest_mode of resize op (pytorch#2257) <daquexian>
- **[64b4b686](onnx/onnx@64b4b686)**: Adding sparse tensor to ONNX (pytorch#2019) <G. Ramalingam>
- **[c8a8b7cc](onnx/onnx@c8a8b7cc)**: Support Range operator in ONNX (pytorch#2242) <Hariharan Seshadri>
- **[44b0d6d5](onnx/onnx@44b0d6d5)**: Update resize op (pytorch#2057) <daquexian>
- **[7d907964](onnx/onnx@7d907964)**: Add function to fuse dynamic quantization graph into 1 node (pytorch#2187) <Ashwini Khade>
- **[36f8e6d9](onnx/onnx@36f8e6d9)**: Update logo_request.md (pytorch#2231) <Prasanth Pulavarthi>
- **[4eb737c8](onnx/onnx@4eb737c8)**: Update Clip in opset 11 to support min/max as inputs instead of attributes (pytorch#2096) <Bowen Bao>
- **[a25e1388](onnx/onnx@a25e1388)**: Fix segfault in tile shape inference (pytorch#2221) <daquexian>
- **[2dc273c7](onnx/onnx@2dc273c7)**: update onehot shape inference to reflect the spec for depth input (pytorch#2224) <Ashwini Khade>
- **[665211c1](onnx/onnx@665211c1)**: Add GatherElements Op and Rename ScatterElements (pytorch#2143) <Lara Haidar>
- **[3ba2e31a](onnx/onnx@3ba2e31a)**: Unique (pytorch#2141) <liqunfu>
- **[5a5588ad](onnx/onnx@5a5588ad)**: Clarify dimension variable scoping (pytorch#2211) <G. Ramalingam>
- **[fabe39d5](onnx/onnx@fabe39d5)**: Liqun/topk sort (pytorch#2126) <liqunfu>
- **[453aa644](onnx/onnx@453aa644)**: Update document for NMS (pytorch#2193) <Hector Li>
- **[34e28ec2](onnx/onnx@34e28ec2)**: Handle negative 'axis' value in Split type and shape inferencing (pytorch#2177) <Scott McKay>
- **[28ec4583](onnx/onnx@28ec4583)**: depth to space shuffle order (pytorch#2163) <Negin Raoof>
- **[98f72629](onnx/onnx@98f72629)**: minor updates to fix links in readme (pytorch#2189) <Prasanth Pulavarthi>
- **[321d1467](onnx/onnx@321d1467)**: Add check to disallow squeezing input axes which are not 1 (pytorch#2204) <Ashwini Khade>
- **[573f0dc9](onnx/onnx@573f0dc9)**: fix a bug in fun shape inference (pytorch#2188) <Tang, Cheng>
- **[36dc7110](onnx/onnx@36dc7110)**: Clarify ambiguity in gather spec regarding indices expectation (pytorch#2202) <Ashwini Khade>
- **[a2449673](onnx/onnx@a2449673)**: Fix some minor issues in IR.md and Versioning.md (pytorch#2108) <edgchen1>
- **[349aff69](onnx/onnx@349aff69)**: Skip install typing package for python >=3.5 (pytorch#2199) <bddppq>

Test Plan: ci

Reviewed By: bddppq, benoitsteiner

Differential Revision: D17296390

fbshipit-source-id: 9f9f5ce85d9694128008d756c2ea393bd4e0cb71
Summary:
cc: gchanan zou3519

I will look into why this is failing spuriously.
Pull Request resolved: pytorch#26108

Differential Revision: D17348399

Pulled By: zou3519

fbshipit-source-id: aed4ccfc3f106692d4e32acc029740309570b0c3
Summary:
Pull Request resolved: pytorch#26080

Will be used in c2 ctr_mbl_feed model to PyTorch conversion

Test Plan: Unit test

Reviewed By: yinghai

Differential Revision: D17337604

fbshipit-source-id: a90d9f5dc38301608d1562c6f2418e7f4616e753
Summary: Pull Request resolved: pytorch#25863

Differential Revision: D17347386

Pulled By: Krovatkin

fbshipit-source-id: a42cf56680a27bc3e50fd945ab372a409225b875
Summary:
This basically works a simple filter as you suggested ZolotukhinM

`export PYTORCH_JIT_LOG_LEVEL=guard_elimination` will print all `GRAPH_DUMP` and `GRAPH_UPDATE` statements.
`export PYTORCH_JIT_LOG_LEVEL=>guard_elimination:>alias_analysis` will print all `GRAPH_DUMP`, `GRAPH_UPDATE` **and** `GRAPH_DEBUG` statements in `guard_elimination.cpp` **and** in `alias_analysis.cpp`
Pull Request resolved: pytorch#25895

Differential Revision: D17309090

Pulled By: Krovatkin

fbshipit-source-id: 8fa9e67cc9af566b084d66cc15223633fda08444
Summary:
Pull Request resolved: pytorch#25606

This just complicates the codegen for no benefit.

Test Plan: Imported from OSS

Differential Revision: D17172498

Pulled By: gchanan

fbshipit-source-id: d2f50e45400ac0336792422518e03dbae3a1bedc
Summary:
Pull Request resolved: pytorch#25607

Since we don't generate these as end-user bindings, and we no longer reorder based on this property, we can just get rid of the property.

Test Plan: Imported from OSS

Differential Revision: D17172500

Pulled By: gchanan

fbshipit-source-id: f84fd8bb2b13598501897f56871b21339585d844
Summary:
Pull Request resolved: pytorch#25897

It doesn't hurt to set all variables unconditionally.
And we can create link to lib directory instead of specific files - this
way it's easier to switch between dynamic/static library names.

Test Plan:
- check android gradle CI;
- use stack diff to check all 4 architectures on PR;

Pull Request resolved: pytorch#25897

Differential Revision: D17307240

Pulled By: ljk53

fbshipit-source-id: c975085ddda852ef7da1c29935c2f6a28d797e5a
Summary:
Pull Request resolved: pytorch#25984

Link static libtorch libraries into pytorch.so (API library for android)
with "-Wl,--gc-sections" flag to remove unused symbols in libtorch.

Test Plan:
- full gradle CI with stacked PR;
- will check final artifacts.tgz size change;

Differential Revision: D17312859

Pulled By: ljk53

fbshipit-source-id: 99584d15922867a7b3c3d661ba238a6f99f43db5
Summary:
Pull Request resolved: pytorch#26113

After pytorch#16914, passing in an
argument such as "build_deps" (i.e. python setup.py build_deps develop) is
invalid since it gets picked up as an invalid argument.
ghstack-source-id: 90003508

Test Plan:
Before, this script would execute "python setup.py build_deps
develop", which errored. Now it executes "python setup.py develop" without an
error. Verified by successfully running the script on devgpu. In setup.py,
there is already a `RUN_BUILD_DEPS = True` flag.

Differential Revision: D17350359

fbshipit-source-id: 91278c3e9d9f7c7ed8dea62380f18ba5887ab081
Summary: Pull Request resolved: pytorch#25608

Test Plan: Imported from OSS

Differential Revision: D17172494

Pulled By: gchanan

fbshipit-source-id: 5a46889cc040297231e2473ae5b2879b39f8d60a
Summary:
base_lr parameter was being overridden by super `__init__`, see pytorch#21965.
Pull Request resolved: pytorch#26105

Reviewed By: yf225

Differential Revision: D17346724

Pulled By: vincentqb

fbshipit-source-id: 4b146bd64f4f385c0a9c4f4df8eb8991312fb15c
Summary:
Pull Request resolved: pytorch#25504

Skip inserting duplicate observers for values observed
in forward method of a child module or other methods in
the current module.

Test Plan:
python test/test_jit.py -- 'TestJit.insert_observers'
python test/test_jit.py -- 'TestJit.insert_observers_child_qconfig'
python test/test_jit.py -- 'TestJit.insert_observers_skip_values'

Imported from OSS

Differential Revision: D17208888

fbshipit-source-id: e04f1c22ab1c4f410933a17a3ef31acf5f217323
Elias Ellison and others added 25 commits September 19, 2019 15:46
Summary:
In schema matching we allow a homogenous tuple to be matched to list arguments. This logic wasn't yet extended for vartype lists, causing stuff like `len((1, 2, 3))` to fail.

Fix for pytorch#20500
Pull Request resolved: pytorch#25944

Differential Revision: D17482510

Pulled By: eellison

fbshipit-source-id: aa63318c27a01d965a7a7b68ce8bec638168dc26
Summary:
At the moment it includes pytorch#26219 changes. That PR is landing at the moment, afterwards this PR will contain only javadocs.

Applied all dreiss comments from previous version.
Pull Request resolved: pytorch#26149

Differential Revision: D17490720

Pulled By: IvanKobzarev

fbshipit-source-id: f340dee660d5ffe40c96b43af9312c09f85a000b
Summary:
This PR adds support for multidimensional inputs to `torch::tensor`, to match the Python `torch.tensor` API.

Closes pytorch#16099.
Pull Request resolved: pytorch#26210

Differential Revision: D17456761

Pulled By: yf225

fbshipit-source-id: a53ce74c535c13c5dcb833f19e9b6b79d12376b5
…torch#26471)

Summary:
Pull Request resolved: pytorch#26471

att

Test Plan:
.

Imported from OSS

Differential Revision: D17491215

fbshipit-source-id: 5790aa0113bfdbeeb838f3d1530397606ccaa1e9
Summary:
Serialization.cpp fails on big endian machines.
This patch fixes the endian bugs and also makes the pytorch
model files portable across different endian architectures.
x86 generated model file can be read on s390 arch.

First problem, is serialization.cpp forgets to convert "size" value
of the storage elements to the native byte order.
torch.load throws an assertion as a result
(see the first stack trace below).

Second problem is when it reads the model from storage (doRead)
it decodes values to little endian which is the wrong order
on a big endian machine.  The decode should be
to THP_nativeByteOrder() instead
	(see the model dump below)
```loaded_model = torch.load( opt.model_file, map_location=torch.device("cpu"))
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 422, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 616, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 2305843009213693952 got 32
	(the very long number is actually 32 in the wrong endianness)
```

Model file load on x86 (correct output)
```>>> import torch
>>> torch.load('400f2k_best.model', map_location=torch.device("cpu"))
{'epoch': 24, 'model_type': 'emb_aec', 'classifier_model': OrderedDict([('model.0.weight', tensor([[ 2.4608e-01, -1.1174e-01, -1.0854e-01,  4.0124e-01, -1.5261e-02,
         -1.2206e-01,  1.3229e-01, -1.2615e-01, -5.2773e-01,  2.6333e-01,
         -3.1462e-03, -1.4902e-01,  9.8545e-02, -1.5789e-01, -2.2625e-01,
         -1.0776e-01, -9.0895e-02, -3.8530e-01,  9.1152e-01, -3.9720e-01,
         -8.5848e-01, -4.7837e-02, -1.5178e-01,  8.5023e-02,  1.5013e-01,
         -9.9294e-02, -2.7422e-01, -4.3986e-01, -4.4297e-01, -3.9570e-01,
```

Model file load on s390x (wrong endianness; notice the exponents)
```>>> import torch
>>> torch.load( "400f2k_best.model", map_location=torch.device("cpu"))
{'epoch': 24, 'model_type': 'emb_aec', 'classifier_model': OrderedDict([('model.0.weight', tensor([[ 9.2780e+21, -9.7722e-11,  4.1350e+33,  7.782e+34,  4.2056e-31,
          9.0784e+18,  1.1846e-32,  3.3320e-32, -4.8288e-28, -7.2679e+12,
          1.5379e-16, -5.2604e+12, -4.7240e+17,  4.6092e-21, -1.8360e-20,
         -2.7712e-31,  1.4548e-16, -2.5089e-27,  7.9094e-10,  7.1977e+34,
          1.1930e+26,  8.4536e+15,  2.7757e+23, -5.8455e-10, -1.5611e+09,
         -1.1311e-23,  6.6451e+19, -2.0970e+20,  3.4878e-19, -1.0857e-12,
          7.8098e+22,  5.3998e-35],
```
Pull Request resolved: pytorch#26383

Differential Revision: D17480891

fbshipit-source-id: f40569c7b9c4a1935dceb41f1a2508ce21ea3491
Summary:
Pull Request resolved: pytorch#26477

- At inference time we need turn off autograd mode and turn on no-variable
  mode since we strip out these modules for inference-only mobile build.
- Both flags are stored in thread-local variables so we cannot simply
  set them to false glboally.
- Add "autograd/grad_mode.h" header to all-in-one header 'torch/script.h'
  to reduce friction for iOS engs who might need do this manually in their
  project.

P.S. I tried to hide AutoNonVariableTypeMode in codegen but figured it's not
very trivial (e.g. there are manually written part not covered by codegen).
Might try it again later.

Test Plan: - Integrate with Android demo app to confirm inference runs correctly.

Differential Revision: D17484259

Pulled By: ljk53

fbshipit-source-id: 06887c8b527124aa0cc1530e8e14bb2361acef31
Summary:
Pull Request resolved: pytorch#25975

We would like to add the FP16 weight support for the dynamic quantized LSTM.

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_quantized_rnn \(test_quantization\.PostTrainingDynamicQuantTest\)'  --print-passing-details

```
[jianyuhuang@devvm794.ftw3.facebook.com: ~/fbsource/fbcode/caffe2/test] $ buck test mode/dev caffe2/test:quantization
-- 'test_quantized_rnn \(test_quantization\.PostTrainingDynamicQuantTest\)'  --print-passing-details
Building: finished in 13.4 sec (100%) 8134/8134 jobs, 81 updated
  Total time: 13.9 sec
Trace available for this run at /tmp/testpilot.20190910-210241.2092790.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision c86e65add357582accb6ec0be23b92c8a2c510bd fbpkg ca46e8f5b26c451a8b0b2462c11bb61d at Mon Sep  9
22:16:37 2019 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/696/t.par
Discovering tests
Running 1 tests
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/1125900050322971
      ✓ caffe2/test:quantization - test_quantized_rnn (test_quantization.PostTrainingDynamicQuantTest) 0.183 1/1 (passed)
Test output:
> test_quantized_rnn (test_quantization.PostTrainingDynamicQuantTest) ... ok
>
> ----------------------------------------------------------------------
> Ran 1 test in 0.184s
>
> OK
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/1125900050322971
Summary (total time 4.35s):
  PASS: 1
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
```

Differential Revision: D17299116

fbshipit-source-id: 7fe91ece25867f2c0496f1b63fb1041e6b815166
Summary:
Adds `Distance` module parity.
pytorch#25883
Pull Request resolved: pytorch#26424

Differential Revision: D17487314

Pulled By: yf225

fbshipit-source-id: c7d124cb4afb08a4733e7212af0bb276bf32d172
…ytorch#26498)

Summary:
Pull Request resolved: pytorch#26498

We should allocate an empty tensor as a result tensor when performing
binary ops. Currently some ops use `empty_like(self)` as the initial
result tensor before passing it into TensorIterator. This is not very
efficient because TensorIterator may resize the tensor due to
broadcasting, causing more memory allocation. By using an empty tensor
as the result tensor, we only need to allocate/resize memory once as
opposed to twice.

Also fixes pytorch#26495. The bug
there is that the implementation of `pow` is missing a resize in one
case.

Test Plan:
- new test
- run tests

Differential Revision: D17500025

Pulled By: zou3519

fbshipit-source-id: bff4949af5e75541c04669b961bcf2e1ec456faf
Summary:
Pull Request resolved: pytorch#26504

[pytorch] [distributed] Make distructor virtual for class with virtual function
Not having virtual distructor may lead to a memory leak.
ghstack-source-id: 90454880

Test Plan: Made sure pg based UT works.

Differential Revision: D17488876

fbshipit-source-id: 5fdc55e175fd2b22e931b740c36cb1feed454066
Summary:
test_wrapped_number was calling torch.set_default_tensor_type('torch.FloatTensor'), which was setting the default tensor types for all following tests until a class boundary (with unittest) or until end of file (with pytest). Tests that don't expect the default tensor type to be set this way were then failing if run afterwards.

This fixes the issue by copying the default_tensor_type decorator from test_nn and using that instead with test_wrapped_number. The decorator correctly resets the default tensor type after the test has run.

This fixes the many errors encountered when running pytest test_jit.py.

Note: test_wrapped_number was introduced in pytorch#22273.
Pull Request resolved: pytorch#26523

Differential Revision: D17495283

Pulled By: mruberry

fbshipit-source-id: ab518c78b7706af7cb1c2d1c17823d311178996d
Summary:
These are intentionally not yet used by the encoder to avoid backcompat issues.
Pull Request resolved: pytorch#26454

Differential Revision: D17480844

fbshipit-source-id: e88ae7f5b94e32c7f12341a750aa4b9f7374bfb7
Summary:
Pull Request resolved: pytorch#26501

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

XLA companion patch at pytorch/xla#1031

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

The new generated code looks like this:

```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
    static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
    return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```

The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D17499154

Pulled By: ezyang

fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c
Summary:
Pull Request resolved: pytorch#26364

Per pytorch#25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```
ghstack-source-id: 90480086

Test Plan: See above.

Differential Revision: D17430164

fbshipit-source-id: 1482aca72fcc3ddb95ea25649ec057edda5d1934
Summary:
Pull Request resolved: pytorch#26515

Fix patterns of `prepack` and `permute` after recent changes
to `quantized::conv2d` and `quantized::conv2d_prepack`

Test Plan:
python test/test_jit.py 'TestJit.test_quant_fusion'

Imported from OSS

Differential Revision: D17502573

fbshipit-source-id: 1a719fd610e8ea9dc16075abaa042556e1edbceb
Summary:
If the `Union` contains a non-class type, `issubclass` would fail, this
adds a check for that case
Pull Request resolved: pytorch#26312

Pulled By: driazati

Differential Revision: D17486465

fbshipit-source-id: c513cef3bbc038f15c021eb0c1bf36be0df1eb90
Summary:
When used as annotations on Python functions, `NamedTuple`s go through our Python annotation -> type mapping which previously had no way of lookup up `NamedTuple`s (which are created lazily by checking if the type has certain properties, so the lookup is creating the `TupleType` from scratch). This PR threads through the necessary data to make them work.

Fixes pytorch#26437
Pull Request resolved: pytorch#26443

Pulled By: driazati

Differential Revision: D17486441

fbshipit-source-id: a6bbb543ff05a5abe61f1a7f68db9ecdb652b358
Summary:
With this PR, we establish the following conventions:
1. Options in C++ module / optimizer constructors should always be `const SomeOptions&` type, not `SomeOptions` type.
2. The options constructor arg should always be named `options_`, not `options`, to not be confused with the module / optimizer's internal field `options`.
3. We never use `std::move` to assign `options_` to the module / optimizer's internal field `options` in the constructor definition. Instead, we simply use `options(options_)`.

Here is the reasoning:
We might be tempted to declare the constructor as `SomeModule(SomeOptions options_)` and have `options(std::move(options_))` in the member initialization list. However, this can be a dangerous design because the constructor might use `options_` to set values for other member fields in the member initialization list (e.g. https://github.com/pytorch/pytorch/blob/8317f75b79fb78ceeeb928aa23a901d57274b9e1/torch/csrc/api/include/torch/optim/lbfgs.h#L30-L34), and use-after-move can cause hard-to-debug problems.
Instead, we choose to explicitly use `const SomeOptions&` type for `options_`, and never use `std::move` to assign it to the internal `options` field. This way we have stronger guarantee on the validity of `options_` at any point in the constructor.

Notable exceptions to the above conventions:
1. C++ Embedding module doesn't adhere to the conventions now, which will be fixed after pytorch#26358 is landed.
2. C++ dataloader and dataset classes likely need similar changes. We will do it when we start to work on dataloader/dataset parity.

Thanks ShahriarSS for discovering the options usage inconsistency! 🚀
Pull Request resolved: pytorch#26483

Differential Revision: D17500451

Pulled By: yf225

fbshipit-source-id: 49361a3519e4ede933789db75731d40144f0b617
Summary:
Pull Request resolved: pytorch#26479

This PR doesn't delete the code for them yet because it takes some effort to
determine what to delete. I will send a followup PR fully deleting
tagged names, but this PR disables their creation.

Test Plan: - [namedtensor ci]

Differential Revision: D17484758

Pulled By: zou3519

fbshipit-source-id: 451409e36eac98ffee1b98884d0f675bb5d46c9d
Summary: Pull Request resolved: pytorch#26365

Test Plan: - [namedtensor ci]

Differential Revision: D17484759

Pulled By: zou3519

fbshipit-source-id: 44068c1e9d84adf36c5ab5e7006a153b948914d6
Summary:
Pull Request resolved: pytorch#26366

Changes:
- `NameType::NORMAL` -> `NameType::BASIC`
- `Dimname::is_wildcard` -> `Dimname::isWildcard()`
- `Dimname::is_normal` -> `Dimname::isBasic()`.
- `at::is_valid_identifier` -> `Dimname::isValidName(string)`
- `at::match`, `at::unify` are now methods on `Dimname`.

I am adopting CamelCase for struct members of a named tensor related
struct.

Test Plan: - [namedtensor ci]

Differential Revision: D17484757

Pulled By: zou3519

fbshipit-source-id: 21c128e5025e81513e14d34506a7d7744caefdc2
Summary: Pull Request resolved: pytorch#26217

Test Plan: Imported from OSS

Differential Revision: D17427577

Pulled By: pbelevich

fbshipit-source-id: e9b3e76ca44df883e3038b688dd7b930752d93a2
Test Plan: revert-hammer

Differential Revision:
D17486465

Original commit changeset: c513cef3bbc0

fbshipit-source-id: 567311c001d7dd0b7ab9ffe8bb894954bea583c9
Test Plan: revert-hammer

Differential Revision:
D17427577

Original commit changeset: e9b3e76ca44d

fbshipit-source-id: a5bbae208ba33a31f90ab5c9b199f232de0c6d1b
Copy link

@iotamudelta iotamudelta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, all the required tests pass.

@rohithkrn rohithkrn merged commit 02476d2 into ROCm:bf16_bringup Sep 21, 2019
rohithkrn added a commit that referenced this pull request Sep 30, 2019
rohithkrn added a commit that referenced this pull request Oct 1, 2019
@rohithkrn rohithkrn deleted the rn/up-master branch October 3, 2019 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.