# Non-Negative Matrix Factorization Topic Modeling Algorithm Introduction

### Joseph Jinn and Keith VanderLinden

This Jupyter Notebook file provides a very simple high-level overview of the NMF topic modeling algorithm.  We briefly discuss the plate notation diagram, pseudocode, and statistical formula for the model.

### Graphical Model for the NMF Algorithm:

Note: This diagram is from a implementation of NMF that was custom-designed by a graduate student.  It isn't the Scikit-Learn implementation but the general description of how Non-negative Matrix Factorization works provided by the blog should be applicable regardless..

Approximate Matrix X (data) by multiplying two smaller matrix W and H to approximately reconstruct Matrix X (data).  

No element of any of these matrices can possess negative values (must be non-negative).  

No element of any of these matrices can be missing.

![nmf](../images/nmf_model.png)

Matrix $X$ is a $N \times M$ matrix.

Matrix $W$ is a $N \times K$ matrix.

Matrix $H$ is a $K \times M$ matrix.

$(N \times K)$ $\cdot$ $(K \times M)$ $\approx N \times M$

$X_{i, j}$ represents the # of times word $i$ appears in document $j$.  Therefore, Matrix $X$ is a sparse matrix.

### Statistical Formula for Calculating the NMF algorithm:

Note: Matrix A = Matrix X (images from multiple sources using different notation).  This section details information from a blog posting where the author utilizes the Scikit-Learn NMF model so the information should be accurate as to how the statistics/mathematics work.  This is far more layman-friendly than the scholarly article itself on which Scikit-Learn bases its NMF Class on.

Non-negative matrix factorization possesses a inherent clustering property that can be used for topic modeling.  According to the blog post "Topic Modeling with LDA and NMF on the ABC News Headlines dataset", factoring Matrix A into Matrix W and Matrix H allows the three matrices involved to represent the following information:

- Matrix $A$: document-word matrix representing the input that specifies what words appear in each document in the corpus.

- Matrix $W$: basic vectors representing the topics (clusters of words) extracted from the documents in the corpus.

- Matrix $H$: coefficient matrix representing the membership weights for each of the topics present in each document in the corpus.

Terms:

Basic vectors: A vector basis of a vector space V is defined as a subset $v_{1},...,v_{n}$ of vectors in $V$ that are linearly independent and span V. 

Coefficient matrix: A matrix consisting of the coefficients of the variables in a set of linear equations. The matrix is used in solving systems of linear equations.


<img src="../images/nmf_equation.png" width="400px" align="left" alt="nmf equation" >

The above is the objective function that is optimized in order to calculate the values for Matrices W and H.  Matrices W and H are updated iteratively until convergence.

The equation is a measure of the "error of reconstruction between A and the product of it’s factors W and H, based on Euclidean distance".

<img src="../images/nmf_equation_2.png" width="200px" align="left" alt="nmf equation 2" >

<img src="../images/nmf_equation_3.png" width="200px" align="left" alt="nmf equation 3" >

The above are the update rules for Matrices W and H that are derived from the objective function.  The values for Matrices W and H are calculate in parallel and then used to re-calculate the error of reconstruction that is given by the objective function.  This process is repeated until convergence is reached.

### Pseudocode for the NMF algorithm:

Pseudocode from the scholarly article linked in Scikit-Learn NMF Class documentation.  I won't pretend to understand all the notation and references to the statistics behind the algorithm.

![nmf pseudocode](../images/nmf_pseudocode.png)

## A Simplified Non-Negative Matrix Factorization Topic Modeling Algorithm Example:

Placeholder.

**TODO - implement simple hand-worked example of one iteration through the algorithm (provided we can find an example)** 

### Completion of the FIRST iteration of the NMF algorithm:

Rinse and repeat.

## Resources Referenced:

- https://mlexplained.com/2017/12/28/a-practical-introduction-to-nmf-nonnegative-matrix-factorization/
    - explanation of NMF linear algebra with example code.


- https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df
    - another explanation of NMF mathematics with example code; used a significant portion of this explanation.


- https://towardsdatascience.com/topic-modeling-for-the-new-york-times-news-dataset-1f643e15caac
    - another blog article on utilizing NMF for topic modeling; used the first image.
    
    
- https://en.wikipedia.org/wiki/Non-negative_matrix_factorization
    

- https://en.wikipedia.org/wiki/NP-hardness
    - NMF is a NP-hard problem.
    
    
- https://pdfs.semanticscholar.org/b5d0/36429877568a648389531e323ea0983a5148.pdf
    - scholarly article describing NMF on short texts.
    

- https://www.cc.gatech.edu/~hpark/papers/nmf_book_chapter.pdf
    - another article on NMF topic modeling; used the pseudocode and formulas on page 7.
    
    
- http://mathworld.wolfram.com/VectorBasis.html
    - definition of a basis vector.
  
  
- https://en.wikipedia.org/wiki/Coefficient_matrix
    - definition of a coefficient matrix.
    
    