#Randomized Matrix Multiplication
Approximating a large sparse matrix through random sampling.
Background:
Initially, this began as a research project that my Linear Algebra professor assigned and it was just that. However, after doing my research on the topic, my interest in thenoverall topic grew as I began to pick up on familliar topics from previous work.
Intuition:
Large matricies are computationally expensive so when we have a large sparse matrix like a dataset of word occurences, we spend a lot of compute on just zeros. So, we have developed ways to approximate these large matricies in order to reduce computation.
Algorithm (From my understanding):
The algorithm that I used to approximate the matrix had two options: Uniform Sampling or Non-Uniform/Norm based sampling. Our overall goal is to find the Expected Value (EV) of the matrix from a handful of samples to have a smaller and more dense matrix. The pitfall of Uniform Sampling treats all columns of the matrix equally important even when most of these columns are zero. Norm based sampling weighs a column's importance based on how much "energy" it contributes to the total matrix. It's a proprtional metric.
Then in the same way we measure the energy of these columns, we also determine the total error of our approximation by comparing it to the original matrix. Dividing these two together (Matrix / Approximation), we get some C value. If C is close to 1: we succeeded; If C is below one by a large margin: our matrix was underdetermined; If C is above one: our matrix was overdetermined.
Results:
With L being our sample size, I found that we were able to have 90% accuracy by sampling only 20% of the dataset. However, beyond this point, any increase beyond this only resulted in small benefits to the accuracy suggesting an aysymptotic error function.
As you can see, sample sizes below L = 1000 is quite laughable but as L increases, our error decreases rapidly over the span of [0, 5000].
For a more in depth look, I was able to get the error just under 10% with a dimesnion size of 10 and a sample size of 20k.
In conclusion, we were able to approximate this 30 billion entry matrix through our randomization algorithm. Despite only using a fifth of the matrix, we still achieved 90% accuracy which is quite interesting. As for NLP, such as this dataset was used for, I think that the small loss is quite worth the large difference in computational costs. Though for some datasets that need more precision or are much more than sparse matricies, this may harder to implement.
To run:
Simply install rquirements.txt
You may have to change the code based on the number of L values you'd like to have and the type of graph you want
For any further data analysis please reference: docword.nytimes.txt.gz from https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nytimes.txt.gz
The file was too large to include here.

