# Non-negative Matrix Factorization (NMF)

The key idea behind NMF is to approximate a given matrix $ V $ by two smaller matrices that capture the underlying structure or latent features of the data.

$ W $ is the matrix representing the basis (or components);

$ H $ is the matrix representing the coefficients (or activations) of these components.


Since both $ W $ and $ H $ are non-negative, the resulting components or topics tend to be more interpretable. Topics are formed by additive combinations of terms.

## 0. Setting up the evironment

Load a language model for preprocessing text in Russian

In [1]:
!python -m spacy download ru_core_news_sm -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m68.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.8/53.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m54.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('ru_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import spacy

nlp = spacy.load("ru_core_news_sm", disable=["ner", "parser"])

A function to produce a lemmatized version of the input text

In [3]:
def preprocessor(text):
  lemmas = list()
  doc = nlp(text)
  for token in doc:
    if token.is_alpha and token.is_stop == False:
      lemma = token.lemma_
      lemmas.append(lemma)
  return ' '.join(lemmas)

## 1. The data

In [4]:
import pandas as pd

Read the dataset

In [8]:
df = pd.read_csv('/content/drive/MyDrive/2024/компьютерная лингвистика/unsupervised learning/sports_articles.tsv', sep='\t')
df

Unnamed: 0,headline,summary
0,Наказание за самоуверенность: Малыхин проиграл...,Анатолий Малыхин в Бангкоке потерпел сенсацион...
1,«Находится на низком уровне»: Васильев — о ско...,Если Камила Валиева по своим физическим и мент...
2,"Победа «Спартака» в дерби, долги «Лады» по зар...",«Спартак» в третий раз обыграл ЦСКА в нынешнем...
3,«Верить ему нельзя»: глава World Athletics Коу...,Президент World Athletics Себастьян Коу сообщи...
4,Затмил дуэль Кучерова и Федотова: Капризов офо...,Кирилл Капризов трижды ассистировал партнёрам ...
...,...,...
904,Короли разделки: Большунов одержал 20-ю победу...,Александр Большунов завоевал третью золотую ме...
905,"Тримуф «Канзас-Сити» в овертайме, магия Махоум...",«Канзас-Сити» взял верх над «Сан-Франциско» в ...
906,"Хет-трик Афифа с пенальти, подачи Адингры и ка...",Катар и Кот-д'Ивуар одержали победы в финалах ...
907,«Она просто быстрее и мощнее»: Касаткина уступ...,Дарья Касаткина не завоевала титул на турнире ...


## 2. Vectorization

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

Feature extraction

In [11]:
documents = df['summary'].to_list()

Build the Document-Term Matrix

In [12]:
vectorizer = TfidfVectorizer(preprocessor=preprocessor)
X = vectorizer.fit_transform(documents)
X

<909x8434 sparse matrix of type '<class 'numpy.float64'>'
	with 45579 stored elements in Compressed Sparse Row format>

In [13]:
dtm = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
dtm

Unnamed: 0,abc,aca,add,amc,ap,aquatics,ard,asia,athletic,athletics,...,ясмин,ясмина,ясмину,ясюн,ятимова,ятимову,яхта,яхтенный,яхью,яшкин
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.197143,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
906,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
907,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. Dimensionality reduction techniques

In [14]:
from sklearn.decomposition import NMF

**Non-negative Matrix Factorization**:


In [16]:
nmf = NMF(n_components=15, random_state=42)

X_nmf = nmf.fit_transform(dtm)

## 4. Topics

Display the top *n* words for each topic


In [17]:
top_n = 10

terms = vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(nmf.components_):
  print(f"Topic #{topic_idx + 1}:")
  print(" ".join([terms[i] for i in topic.argsort()[:-top_n - 1:-1]]))
  print()

Topic #1:
тайм тур футбол мяч удар перерыв рпл подопечные минута соперник

Topic #2:
париж игра мок спортсмен участие олимпийский федерация соревнование президент решение

Topic #3:
передача очко нхл никита кучеров регулярный результативный панарин артемий кирилл

Topic #4:
сет ракетка андреев турнир брейк партия гейм медведев мирра даниил

Topic #5:
гонка опередить секунда золото бронза завоевать анастасия женщина последний шевченко

Topic #6:
бобровский флорида сергей голкипер отразить стэнли бросок серия тарасенко признать

Topic #7:
вес поединок бой титул ufc турнир раунд судья чемпион первый

Topic #8:
катание фигурный программа александр алина загитов елизавета балл евгений обсуждать

Topic #9:
контракт клуб команда год подписать сборная футболист тренер главный спартак

Topic #10:
гагарин металлург серия локомотив кхл автомобилист авангард период ска трактор

Topic #11:
рпл оренбург урал цска ахмат рубин факел нн динамо пути

Topic #12:
интервью рассказать rt объяснить слово зая