Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comment #26

Open
ghost opened this issue Feb 27, 2019 · 9 comments
Open

Comment #26

ghost opened this issue Feb 27, 2019 · 9 comments

Comments

@ghost
Copy link

ghost commented Feb 27, 2019

If you apply a predetermined random pattern of sign flipping to an input array followed by the fast (Walsh) Hadamard transform you get a random projection of the input data. You can use that for unbiased dimension reduction or increase. Repeat the process for better quality.
The outputs of the random projection strongly follow the Gaussian distribution because the central limit theorem applies not only to sums but also sums and differences.
Anyway if you binarize the outputs of the the random projection you have a locality sensitive hash. If you interpret the output bits as +1,-1 you can multiply each bit by a weight and sum to get a recalled value. To train, recall and calculate the error. Divide by the number of bits. Then add or subtract that from each weight as appropriate to make the error zero. In that way you have created an associative memory.
Because the error has been distributed non-similar input will result in the error fragments being multiplied by arbitrary +1,-1 hash bit values. Again you can invoke the central limit theorem to see the error fragments sum to zero mean, low level Gaussian noise.
The memory capacity is just short of the number of bits. If you use the system under capacity you get some repetition code error correction.
Basically when you store an new memory in that system all the previous memories get contaminate by a little bit of Gaussian noise. However for an under capacity set of training data you can drive that noise to zero by repeated presentation.
This also provides a means to understand extreme learning machines and reservoir computing as associative memory.

@dnbaker
Copy link
Contributor

dnbaker commented Feb 27, 2019

I don't really understand why you've decided to add this comment, but you're right.

If you want to see applications, look at https://github.com/dnbaker/frp for orthogonal JL transforms and Gaussian Kernel Projections.

FALCONN-LIB uses it here for a LSH, as you've mentioned.

@ghost
Copy link
Author

ghost commented Feb 27, 2019

This is the nearest thing to a WHT central station to discuss it a little. I'm interested in adding soft associative memory to neural networks that they can actually learn to use (ordinary RAM being very difficult to learn how to use.) You can also use locality sensitive hashing again to switch in different blocks of weight memory to associative memory to make truly vast memory systems that are still fast yet have soft characteristics. Or use LSH to switch between modules in deep neural networks to boost efficiency.
Also there are various way of using the WHT and or random projections for weight sharing in neural networks or other weighting tricks.
One useful thing you can do to reduce crosstalk (poor input decision boundaries) with vector to vector associative memory is to learn a predeterminded random intermediate target. Instead of learning a->b learn a->r, r->b. In higher dimensions all the different r's are approximately orthogonal. Using that and some other tricks like soft binarization it is definitely possible to exceed Hopfield networks in many metrics.

One simple thing you can try is to use a WHT or random projection after the final layer of a neural network. That can help the output neurons cooperate more efficiently.

@dnbaker
Copy link
Contributor

dnbaker commented Feb 27, 2019

I recommend this paper for adaptive random spinners. The way to do this is to have a linear | FHT | linear | FHT | linear | FHT | <nonlinear transformation>. I imagine it could also be used for some kind of convolution, such as a substitute or extension of depthwise convolutions. (I'm reminded of Pay Less Attention, in particular.)

There's isn't support for an FHT layer in PyTorch yet, but it could be quite useful. The gradient is simply the FHT again; however, there is the issue of normalization, which this library does not perform. (It's smart to avoid it, though; if you repeatedly apply it, you can normalize post hoc.)

@ghost
Copy link
Author

ghost commented Feb 27, 2019

Thanks for the random spinners link.
Anyway one way to do weight sharing is to have a weight array and then have many random projections of that, each providing a different view of the underlying array. Then you use the projected values as weights. A disadvantage is the the weights all have around the same range of magnitude.
If you have say 5 vector to vector random projections of some input vector with n elements then you can organize those into n 5-dimensional unbiased dimension reduced views of the input. You can weight each of those 5 elements, sum them and put them through an activation function. Then you have a neuron with only 5 weights that is nevertheless fully connected in some sense with the entire n input elements.
And actually you can do better than that. Apply a non-linear function to each of the 5 vector elements, then weight each function output and sum. What is that? You have made a little extreme learning machine neuron. You could theoretically increase from 5 element subsamples to some number way beyond n, to get a more than fully connected neuron in some sense. With the non extreme learning machine neuron linear algebra means there is no point in going beyond n sized "subsamples."
When I read the spinner paper maybe it will say you can replace the random sign flipping with instead weighting each input before the WHT to get interesting possibilities.
In terms of normalizing the output of your WHT code the version of gcc I'm using (8.2.0) does a good job of autovectorizing the scaling code I wrote to use the SIMD instructions.

@ghost
Copy link
Author

ghost commented Mar 1, 2019

Then there are few less useful things. If you feed the output of a vector to vector random projection back to its input you get a kind of oscillator in which information will reverberate around in. You can probably understand that with digital signal processing math. You can use that in reservoir computing. If you connect those together the stored information spreads out in a similar way with entropy in a physical system.
You probably can speak too of the entropy of a WHT based random projection.
If you use random permutations as well as or instead of random sign flipping that would seem to be much higher entropy than just sign flipping on its own. However random permutation is rather time consuming on CPUs because it really hammers the caches.

The intermediate calculations in the (out of place) WHT algorithm are wavelet like. If you normalized each layer of calculation then you can pick out the highest energy/magnitude entry in the entire system, note its value and location, then remove it by setting its value to zero. Then propagate to every other layer the effects of that zeroing. Then repeat a number of times.
Later you can decompress using just the data you have collected.

Anyway the random projections are related to random point picking on a hypersphere:

http://mathworld.wolfram.com/HyperspherePointPicking.html
As the normalized WHT and sign flipping leave vector length unchanged and any randomness gets converted into the Gaussian distribution.
Also this early paper is not quite right because of the length invariance issue:
https://archive.org/details/bitsavers_mitreESDTe69266ANewMethodofGeneratingGaussianRando_2706065?q=normal+random+esd

This website is slightly messed up due to injected advertising however there is some information about using WHT random projections with smoothing for compressive sensing:
http://md2020.eu5.org/wht3.html
http://md2020.eu5.org/wht1.html

I don't know how useful that was. It is information largely in the form of hints that you can follow up or not depending on your disposition.

There is a book that has some information about the WHT:
https://www.jjj.de/fxt/

@ghost
Copy link
Author

ghost commented May 21, 2019

@ghost
Copy link
Author

ghost commented May 30, 2019

The central limit theorem applies not only to the sums, it also applies to sums and differences (eg. WHT.) Especially where there is any kind of randomness involved the output of the WHT will tend toward the Gaussian distribution. That can be quite awkward sometime. You can get the uniform distribution by normalizing the vector length of the output of the WHT and applying the Gaussian CDF function to each element. The vector elements then range between 0 and 1. Alternatively with no need to normalize you can apply atan2(x,y) to 2 elements at a time and get a uniform over the range of that function.

Sometimes the WHT is too slow at a time cost of O(nln(n)). You could use what I call the hstep function if you only need different mixtures (windows onto) an input vector, say for associative memory.
Step through the elements of the input vector pairwise, put the sum of the 2 elements sequentially in the lower half of a new array, the difference sequentially in the upper half. If you repeat hstep k=log_base_2(n) times you have the out of place WHT. If you keep going of course you get back to your original vector but the WHT unwinds in the different way to the forward direction. So for a vector with 1024 elements you have k=10 and you can have 20 different views of the original vector including the original itself.

The problem with the out of place algorithm is it hammers memory bandwidth, you read data, you do +-, you write data. Even the CPU caches don't have enough bandwidth for that, the algorithm is all data movement. You can use the in-place algorithm to address that issue.
Anyway it should obviously be possible to do the WHT in specialized hardware very quickly and with low power consumption. A chip containing only integer add, subtract and bit shift circuits would be very power efficient, very small and use not so many transistor.
It would even be possible to have a completely analog system using high speed opamps.
I suggested that to Science Foundation Ireland 20 year ago and was told to ......

@ghost
Copy link
Author

ghost commented Aug 1, 2019

Fixed Filter Bank Neural Networks.
Adjust the nonlinear functions not the filters (ie. not the weighted sums.)
https://github.com/S6Regen/Fixed-Filter-Bank-Neural-Networks
https://discourse.numenta.org/t/fixed-filter-bank-neural-networks/6392
https://discourse.numenta.org/t/distributed-representations-1984-by-hinton/6378/10

Using the fast Walsh Hadamard transform as the fixed filter bank.
It seems to work well within the very weak hardware I have available for testing.

@ghost
Copy link
Author

ghost commented Oct 14, 2019

ReLU is a literal switch. An electrical switch is n volts in, n volts out when on. Zero volts out when off.
The weighted sum (dot product) of a number of weighted sums is still a linear system.
For a particular input to a ReLU neural network all the switches are decidedly in either the on or off state. A particular linear projection is in effect between the input and output.
For a particular input and a particular output neuron there is a particular composition of weighted sums that may be condensed down into a single equivalent weighted sum.
You can look at that to see what it is looking at in the input or calculate some metrics like the angle between the input and the weight vector of the equivalent weighted sum.
If the angle is near 90 degrees and the output of the neuron is large then the vector length of weight vector must be large. That makes the output very sensitive to noise in the inputs. If the angle is near zero then there are averaging and central limit theorem effects that provide some error correction.
Since ReLU switches at zero there are no sudden discontinuities in the output of an ReLU neural network for gradual change in the input. It is a seamless system of switched linear projections.
There are efficient algorithms for calculating certain dot products like the FFT or WHT.
There is no reason you cannot incorporate those directly into ReLU neural networks since they are fully compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant