Skip to content

Commit

Permalink
Mel lain docs (bigartm#731)
Browse files Browse the repository at this point in the history
* Docs refactoring: add loading data page in python userguide

* Docs refactoring: add other sections + fixes
  • Loading branch information
MelLain authored and JeanPaulShapo committed Feb 13, 2017
1 parent 47a13d4 commit c6cf714
Show file tree
Hide file tree
Showing 10 changed files with 538 additions and 4 deletions.
69 changes: 69 additions & 0 deletions docs/tutorials/python_userguide/attach_model.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,71 @@
7. Attach Model and Custom Phi Initialization
=======

Detailed description of all parameters and methods of BigARTM Python API classes can be found in :doc:`../../api_references/python_interface`.

Library supports an ability to access all :math:`\Phi`-like matrices directly from Python. This is a low-level functionality, so it wasn't included in the ``ARTM`` class, and can be used via low-level ``master_component`` interface. User can attach the matrix, e.g. get reference to it in the Python, and can change it's content between the iterations. The changes will be written in the native C++ memory.

The most evidence case of usage of this feature is a custom initialization of :math:`\Phi` matrix. The library initalizes it with random numbers by default. But there're several more complex and useful methods of initialization, that the library doesn't support yet. And in this case ``attach_model`` method can help you.

So let's attach to the :math:`\Phi` matrix of our model:

.. code-block:: python

(_, phi_ref) = model.master.attach_model(model=model.model_pwt)

At this moment you can print :math:`\Phi` matrix to see it's content:

.. code-block:: python

model.get_phi(model_name=model.model_pwt)

Next code can be used to check whether the attaching was successful:

.. code-block:: python

for model_description in model.info.model:
print model_description

The output will be similar to the following

.. code-block:: none



name: "nwt"

type: "class artm::core::DensePhiMatrix"

num_topics: 50

num_tokens: 2500



name: "pwt"

type: "class __artm::core::AttachedPhiMatrix__"

num_topics: 50

num_tokens: 2500



You can see, that the type of :math:`\Phi` matrix has changed from ``DensePhiMatrix`` to ``AttachedPhiMatrix``.

Now let's assume that you have created ``pwt_new`` matrix with the same size, filled with custom values. Let's write these values into our :math:`\Phi` matrix.

.. note::

You need to write the values by accessing ``phi_ref`` variable, you are not allowed to assing it the whole ``pwt_new`` matrix, this operation will lead to an error in future work.

.. code-block:: python

for tok in xrange(num_tokens):
for top in xrange(num_topics):
phi_ref[tok, top] = pwt_new[tok, top] # CORRECT!

phi_ref = pwt_new # NO!

After that you can print :math:`\Phi` matrix again and check the change of it's values. From this moment you can continue our work.
87 changes: 87 additions & 0 deletions docs/tutorials/python_userguide/base_plsa.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,89 @@
2. Base PLSA Model with Perplexity Score
=========

Detailed description of all parameters and methods of BigARTM Python API classes can be found in :doc:`../../api_references/python_interface`.

At this moment you need to have next objects:

- directory with ``my_collection_batches`` name, containing batches and dictionary in binary file ``my_dictionary.dict``; the directory should have the same location with your code file;
- ``Dictionary`` variable ``my_dictionary``, containing this dictionary (gathered or loaded);
- ``BatchVectorizer`` variable ``batch_vectorizer`` (the same we have created earlier).

If everything is OK, let's start creating the model. Firstly you need to read the specification of the ``ARTM`` class, which represents the model. Then you can use the following code to create the model:

.. code-block:: python

model = artm.ARTM(num_topics=20, dictionary=my_dictionary)

Now you have created the model, containing :math:`\Phi` matrix with size "number of words in your dictionary" :math:`\times` "number of topics" (20). This matrix was randomly initialized. Note, that by default the random seed for initialization is fixed to archive the ability to re-run the experiments and get the same results. If you want to have another random start values, use the seed parameter of the ``ARTM`` class (it's different non-negative integer values leads to different initializations).

From this moment we can start learning the model. But typically it is useful to enable some scores for monitoring the quality of the model. Let’s use the perplexity now.

You can deal with scores using the ``scores`` field of the ``ARTM`` class. The score of perplexity can be added in next way:

.. code-block:: python

model.scores.add(artm.PerplexityScore(name='my_fisrt_perplexity_score',
dictionary=my_dictionary))

Note, that perplexity should be enabled strongly in described way (you can change other parameters we didn't use here). You can read about it in :doc:`../scores_descr`.

.. note::

If you try to create the second score with the same name, the ``add()`` call will be ignored.

Now let's start the main act, e.g. the learning of the model. We can do that in two ways: using online algorithm or offline one. The corresponding methods are ``fit_online()`` and ``fit_offline()``. It is assumed, that you know the features of these algorithms, but I will briefly remind you:

- **Offline algorithm**: many passes through the collection, one pass through the single document (optional), only one update of the :math:`\Phi` matrix on one collection pass (at the end of the pass). You should use this algorithm while processing a small collection.

- **Online algorithm**: single pass through the collection (optional), many passes through the single document, several updates of the :math:`\Phi` matrix during one pass through the collection. Use this one when you deal with large collections, and with collections with quickly changing topics.

We will use the offline learning here and in all further examples in this page (because the correct usage of the online algorithm is big skill).

Well, let's start training:

.. code-block:: python

model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)

This code chunk had worked slower, than any previous one. Here we proceeded the first step of the learning, it will be useful to look at the perplexity. We need to use the ``score_tracker`` field of the ``ARTM`` class for this. It remember all the values of all scores on each :math:`\Phi` matrix update. These data can be retrieved using the names of scores.

You can extract only the last value:

.. code-block:: python

print model.score_tracker['my_fisrt_perplexity_score'].last_value

Or you are able to extract the list of all values:

.. code-block:: python

print model.score_tracker['my_fisrt_perplexity_score'].value

If the perplexity had convergenced, you can finish the learning process. In other way you need to continue. As it was noted above, the rule to have only one pass over the single document in the online algorithm is optional. Both ``fit_offline()`` and ``fit_online()`` methods supports any number of document passes you want to have. To change this number you need to modify the corresponding parameter of the model:

.. code-block:: python

model.num_document_passes = 5

All following calls of the learning methods will use this change. Let's continue fitting:

.. code-block:: python

model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15)

We continued learning the previous model by making 15 more collection passes with 5 document passes.

You can continue to work with this model in described way. Now one note: if you understand in one moment that your model had degenerated, and you don't want to create the new one, then use the ``initialize()`` method, that will fill the :math:`\Phi` matrix with random numbers and won't change any other things (nor your tunes of the regularizers/scores, nor the history from ``score_tracker``):

.. code-block:: python

model.initialize(dictionary=my_dictionary)

FYI, this method is calling in the ``ARTM`` constructor, if you give it the dictionary name parameter. Note, that the change of the seed field will affect the call of ``initialize()``.

Also note, that you can pass the name of the dictionary instead of the dictionary object whenever it uses.

.. code-block:: python

model.initialize(dictionary=my_dictionary.name)
41 changes: 41 additions & 0 deletions docs/tutorials/python_userguide/different.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Different Useful Techniques
=======

* **Dictionary filtering**:

In this section we'll discuss dictionary's self-filtering ability. Let's remember the structure of the dictionary, saved in textual format (see :doc:`m_artm`). There are many lines, one per each unique token, and each line contains 5 values: token (string), its class_id (string), its value (double) and two more integer parameters, called token_tf and token_df. token_tf is an absolute frequency of the token in the whole collection, and token_df is the number of documents in the collection, where the token had appeared at least once. These values are generating during gathering dictionary by the library. They differ from the value in the fact, that you can't use them in the regularizers and scores, so you shouldn't change them.

They need for filtering of the dictionary. You likely needn't to use very seldom or too frequent tokens in your model. Or you simply want to reduce your dictionary to hold your model in the memory. In both cases the solution is to use the ``Dictionary.filter()`` method. See its parameters in :doc:`../../api_references/python_interface`. Now let's filter the modality of usual tokens:

.. code-block:: python

dictionary.filter(min_tf=10, max_tf=2000, min_df_rate=0.01)

.. note::
If the parameter has \_rate suffix, it denotes relative value (e.g. from 0 to 1), otherwise - absolute value.

This call has one feature, it rewrites the old dictionary with new one. So if you don't want to lose your full dictionary, you need firstly to save it to disk, and then filter the copy located in the memory.

* **Saving/loading model**:

Now let's study saving the model to disk.

It's important to understand that the model contains two matrices: :math:`\Phi` (or :math:`p_{wt}`) and :math:`n_{wt}`. To make model be loadable without loses you need to save both these matrices. The current library version can save only one matrix per method call, so you will need two calls:

.. code-block:: python

model.save(filename='saved_p_wt', model_name='p_wt')
model.save(filename='saved_n_wt', model_name='n_wt')

The model will be saved in binary format. To use it later you need to load it's matrices back:

.. code-block:: python

model.load(filename='saved_p_wt', model_name='p_wt')
model.load(filename='saved_n_wt', model_name='n_wt')

.. note::

The model after loading will only contain :math:`\Phi` and :math:`n_{wt}` matrices and some associated information (like number of topics, their names, the names of the modalities (without weights!) and some other data). So you need to restore all necessary scores, regularizers, modality weights and all important parameters, like ``cache_theta``.

You can use ``save/load`` methods pair in case of long fitting, when restoring parameters is much more easier than model re-fitting.
2 changes: 2 additions & 0 deletions docs/tutorials/python_userguide/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,7 @@ Python Guide
phi_theta_extraction
coherence
attach_model
ptdw
different

.. vim:ft=rst
101 changes: 101 additions & 0 deletions docs/tutorials/python_userguide/loading_data.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,103 @@
1. Loading Data: BatchVectorizer and Dictionary
=======

Detailed description of all parameters and methods of BigARTM Python API classes can be found in :doc:`../../api_references/python_interface`.

* **BatchVectorizer**:

Before starting modeling we need to convert you data in the library format. At first you need to read about supporting formats for source data in :doc:`../datasets`. It's your task to prepare your data in one of these formats. As you had transformed your data into one of source formats, you can convert them in the BigARTM internal format (batches of documents) using ``BatchVectorizer`` class object.

Really you have one more simple way to process your collection, if it is not too big and you don't need to store it in batches. To use it you need to archive two variables: ``numpy.ndarray`` ``n_wd`` with :math:`n_{wd}` counters and corresponding Python dict with vocabulary (key - index of ``numpy.ndarray``, value - corresponding token). The simpliest way to get these data is sklearn ``CountVectorizer`` usage (or some similar class from sklearn).

If you have archived described variables run following code:

.. code-block:: python

batch_vectorizer = artm.BatchVectorizer(data_format='bow_n_wd',
n_wd=n_wd,
vocabulary=vocabulary)

Well, if you have data in UCI format (e.g. ``vocab.my_collection.txt`` and ``docword.my_collection.txt`` files), that were put into the same directory with your script or notebook, you can create batches using next code:

.. code-block:: python

batch_vectorizer = artm.BatchVectorizer(data_path='',
data_format='bow_uci',
collection_name='my_collection',
target_folder='my_collection_batches')

The built-in library parser converted your data into batches and covered them with the ``BatchVectorizer`` class object, that is a general input data type for all methods of Python API. The batches were places in the directory, you specified in the ``target_folder`` parameter.

If you have the source file in the Vowpal Wabbit data format, you can use the following command:

.. code-block:: python

batch_vectorizer = artm.BatchVectorizer(data_path='',
data_format='vowpal_wabbit',
target_folder='my_collection_batches')

The result is fully the same, as it was described above.

.. note::

If you had created batches ones, you shouldn't launch this process any more, because it spends many time while dealing with large collection. You can run the following code instead. It will create the ``BatchVectorizer`` object using the existing batches (this operation is very quick):

.. code-block:: python

batch_vectorizer = artm.BatchVectorizer(data_path='my_collection_batches',
data_format='batches')

* **Dictionary**:

The next step is to create ``Dictionary``. This is a data structure containing the information about all unique tokens in the collection. The dictionary is generating outside the model, and this operation can be done in different ways (load, create, gather). The most basic case is to gather dictionary using batches directory. You need to do this operation only once when starting working with new collection. Use the following code:

.. code-block:: python

dictionary = artm.Dictionary()
dictionary.gather(data_path='my_collection_batches')

In this case the token order in the dictionary (and in further :math:`\Phi` matrix) will be random. If you'd like to specify some order, you need to create the vocab file (see UCI format), containing all unique tokens of the collection in necessary order, and run the code below (assuming your file has ``vocab.txt`` name and located in the same directory with your code):

.. code-block:: python

dictionary = artm.Dictionary()
dictionary.gather(data_path='my_collection_batches',
vocab_file_path='vocab.txt')

Take into consideration the fact that library will ignore any token from batches, that was not presented into vacab file, if you used it. ``Dictionary`` contains a lot of useful information about the collection. For example, each unique token in it has the corresponding variable - value. When BigARTM gathers the dictionary, it puts the relative frequency of this token in this variable. You can read about the use-cases of this variable in further sections.

Well, now you have a dictionary. It can be saved on the dick to prevent it's re-creation. You can save it in the binary format:

.. code-block:: python

dictionary.save(dictionary_path='my_collection_batches/my_dictionary')

Or in the textual one (if you'd like to see the gathered data, for example):

.. code-block:: python

dictionary.save_text(dictionary_path='my_collection_batches/my_dictionary.txt')

Saved dictionary can be loaded back. The code for binary file looks like next one:

.. code-block:: python

dictionary.load(dictionary_path='my_collection_batches/my_dictionary.dict')

For textual dictionary you can run the next code:

.. code-block:: python

dictionary.load_text(dictionary_path='my_collection_batches/my_dictionary.txt')

Besides looking the content of the textual dictionary, you also can moderate it (for example, change the value of value field). After you load the dictionary back, these changes will be used.

.. note::

All described ways of generating batches automatically generate dictionary. You can use it by typing:

.. code-block:: python

batch_vectorizer.dictionary

If you don't want to create this dictionary, set ``gather_dictionary`` parameter in the constructor of ``BatchVectorizer`` to False. But this flag will be ignored if ``data_format`` == ``bow_n_wd``, as it is the only possible way to generate dictionary in this case.
Loading

0 comments on commit c6cf714

Please sign in to comment.