# Intro

The problem of Text classification can also be solved with Non-Negative
Matrix Factorization (NMF). Generaly, NMF is a strategy employed to
reduce the dimension of feature matrices, for source separation or even
topic extraction. In the context of NLP, it is widely use for topic
modelling, by decomposing the *BoW* or *Tfidf* models into two smaller
matrices that have associate *Documents* and *Words* with *Topics*.
Therefore, from an original $Document \times Words$ matrix, a result of
$Document \times Topics$ and $Words \times Terms$ is generated. From
this separation, just as with LDA, an association between documents and
topics, as well as words and topics, is made.

![Non Negative Matrix Factorization process to reach Topic Encoded
matrices. $A$ is the BoW/TFIDF matrix which can be decomposed into $W$
and $H$. These show the relationship of documents and words with topics,
respectively.](../Figures/NMF.png)

The process of NMF is to find two non-negative matrices *(W, H)* that
when multiplied approximate the non-negative matrix *A*. The guiding
function has the purpose of minimizing the following objective function:

$$\|A-WH\|^2$$

In order to solve this minimization problem, the process follows a
multiplicative-update solver, with an alterning minimization of W and H.

In [2]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups

In [28]:
#Loading a dataset
categories = ['alt.atheism', 'soc.religion.christian', "comp.graphics"]

from sklearn.datasets import fetch_20newsgroups

train_dataset = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
train_dataset.target_names

['alt.atheism', 'comp.graphics', 'soc.religion.christian']

In [29]:
n_samples = 2000
n_features = 1000
n_components = 3
n_top_words = 20

# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vect = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')

tfidf = tfidf_vect.fit_transform(train_dataset.data)

Extracting tf-idf features for NMF...


In [30]:
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)

topic_values = nmf.transform(tfidf)
pred = topic_values.argmax(axis=1)



In [31]:
from sklearn.metrics import average_precision_score, classification_report

y_train = train_dataset.target
a = classification_report(y_train, pred)

print(a)

              precision    recall  f1-score   support

           0       0.33      0.58      0.42       480
           1       0.89      0.98      0.93       584
           2       0.00      0.00      0.00       599

    accuracy                           0.51      1663
   macro avg       0.40      0.52      0.45      1663
weighted avg       0.41      0.51      0.45      1663



## References

- https://www.audiolabs-erlangen.de/resources/MIR/FMP/C8/C8S3_NMFbasic.html