# Text preprocessing to reduce the dimension of the Document-Term matrix (DTM)

Imports

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

Given two documents in the Russian language

In [25]:
documents = [
    "Я люблю свою собаку.",
    "своя собака меня любит."
]

When we construct the DTM the inflected word forms are considered as unique words

In [35]:
vectorizer = TfidfVectorizer()

dtm = vectorizer.fit_transform(documents)

dtm

<2x7 sparse matrix of type '<class 'numpy.float64'>'
	with 7 stored elements in Compressed Sparse Row format>

Let's check the unique word forms identified by the vectorizer

In [34]:
vectorizer.get_feature_names_out()

array(['любит', 'люблю', 'меня', 'свою', 'своя', 'собака', 'собаку'],
      dtype=object)

We can reduce the number of terms (the size of the vocabulary) by performing text preprocessing with the help of spaCy.

First, we need to download a SpaCy language model for Russian.

In [27]:
!python -m spacy download ru_core_news_sm -q

import spacy

nlp = spacy.load("ru_core_news_sm")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m75.0 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('ru_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Let's define a text preprocessing function using spaCy

In [28]:
def spacy_preprocessor(text):
  doc = nlp(text)
  return " ".join([
      token.lemma_              # use the lemmatized form
      for token in doc
      if not token.is_stop      # exclude stop words
      and not token.is_punct    # exclude punctuation
      ])

Now we can pass the text preprocessing function as argument when creating the vectorizer object.

In [31]:
custom_vectorizer = TfidfVectorizer(preprocessor=spacy_preprocessor)

new_dtm = custom_vectorizer.fit_transform(documents)

new_dtm

<2x2 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

Let's check the unique word forms identified by the custom vectorizer

In [33]:
custom_vectorizer.get_feature_names_out()

array(['любить', 'собака'], dtype=object)

## The applied function helped us reduce the vocabulary size