Skip to content

Bcrenshaw24/Randomized-Linear-Algebra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Randomized Matrix Multiplication

Approximating a large sparse matrix through random sampling.

Background:

Initially, this began as a research project that my Linear Algebra professor assigned and it was just that. However, after doing my research on the topic, my interest in thenoverall topic grew as I began to pick up on familliar topics from previous work.

Intuition:

Large matricies are computationally expensive so when we have a large sparse matrix like a dataset of word occurences, we spend a lot of compute on just zeros. So, we have developed ways to approximate these large matricies in order to reduce computation.

Algorithm (From my understanding):

The algorithm that I used to approximate the matrix had two options: Uniform Sampling or Non-Uniform/Norm based sampling. Our overall goal is to find the Expected Value (EV) of the matrix from a handful of samples to have a smaller and more dense matrix. The pitfall of Uniform Sampling treats all columns of the matrix equally important even when most of these columns are zero. Norm based sampling weighs a column's importance based on how much "energy" it contributes to the total matrix. It's a proprtional metric.

Then in the same way we measure the energy of these columns, we also determine the total error of our approximation by comparing it to the original matrix. Dividing these two together (Matrix / Approximation), we get some C value. If C is close to 1: we succeeded; If C is below one by a large margin: our matrix was underdetermined; If C is above one: our matrix was overdetermined.

Results:

With L being our sample size, I found that we were able to have 90% accuracy by sampling only 20% of the dataset. However, beyond this point, any increase beyond this only resulted in small benefits to the accuracy suggesting an aysymptotic error function.

alt text

As you can see, sample sizes below L = 1000 is quite laughable but as L increases, our error decreases rapidly over the span of [0, 5000].

For a more in depth look, I was able to get the error just under 10% with a dimesnion size of 10 and a sample size of 20k.

alt text

In conclusion, we were able to approximate this 30 billion entry matrix through our randomization algorithm. Despite only using a fifth of the matrix, we still achieved 90% accuracy which is quite interesting. As for NLP, such as this dataset was used for, I think that the small loss is quite worth the large difference in computational costs. Though for some datasets that need more precision or are much more than sparse matricies, this may harder to implement.

To run:

Simply install rquirements.txt

You may have to change the code based on the number of L values you'd like to have and the type of graph you want

For any further data analysis please reference: docword.nytimes.txt.gz from https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nytimes.txt.gz

The file was too large to include here.

About

Algorithm to approximate large sparse matricies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages