From c6cf714038eefeb5438042f3f492771e4efc91a9 Mon Sep 17 00:00:00 2001 From: Murat Apishev Date: Mon, 23 Jan 2017 17:40:56 +0300 Subject: [PATCH] Mel lain docs (#731) * Docs refactoring: add loading data page in python userguide * Docs refactoring: add other sections + fixes --- .../python_userguide/attach_model.txt | 69 +++++++++ docs/tutorials/python_userguide/base_plsa.txt | 87 ++++++++++++ docs/tutorials/python_userguide/different.txt | 41 ++++++ docs/tutorials/python_userguide/index.txt | 2 + .../python_userguide/loading_data.txt | 101 ++++++++++++++ docs/tutorials/python_userguide/m_artm.txt | 54 +++++++ .../python_userguide/phi_theta_extraction.txt | 46 ++++++ docs/tutorials/python_userguide/ptdw.txt | 2 + .../regularizers_and_scores.txt | 132 ++++++++++++++++++ docs/tutorials/scores_descr.txt | 8 +- 10 files changed, 538 insertions(+), 4 deletions(-) create mode 100644 docs/tutorials/python_userguide/different.txt create mode 100644 docs/tutorials/python_userguide/ptdw.txt diff --git a/docs/tutorials/python_userguide/attach_model.txt b/docs/tutorials/python_userguide/attach_model.txt index 6482d4a80..17925bcf2 100644 --- a/docs/tutorials/python_userguide/attach_model.txt +++ b/docs/tutorials/python_userguide/attach_model.txt @@ -1,2 +1,71 @@ 7. Attach Model and Custom Phi Initialization ======= + +Detailed description of all parameters and methods of BigARTM Python API classes can be found in :doc:`../../api_references/python_interface`. + +Library supports an ability to access all :math:`\Phi`-like matrices directly from Python. This is a low-level functionality, so it wasn't included in the ``ARTM`` class, and can be used via low-level ``master_component`` interface. User can attach the matrix, e.g. get reference to it in the Python, and can change it's content between the iterations. The changes will be written in the native C++ memory. + +The most evidence case of usage of this feature is a custom initialization of :math:`\Phi` matrix. The library initalizes it with random numbers by default. But there're several more complex and useful methods of initialization, that the library doesn't support yet. And in this case ``attach_model`` method can help you. + +So let's attach to the :math:`\Phi` matrix of our model: + +.. code-block:: python + + (_, phi_ref) = model.master.attach_model(model=model.model_pwt) + +At this moment you can print :math:`\Phi` matrix to see it's content: + +.. code-block:: python + + model.get_phi(model_name=model.model_pwt) + +Next code can be used to check whether the attaching was successful: + +.. code-block:: python + + for model_description in model.info.model: + print model_description + +The output will be similar to the following + +.. code-block:: none + + + + name: "nwt" + + type: "class artm::core::DensePhiMatrix" + + num_topics: 50 + + num_tokens: 2500 + + + + name: "pwt" + + type: "class __artm::core::AttachedPhiMatrix__" + + num_topics: 50 + + num_tokens: 2500 + + + +You can see, that the type of :math:`\Phi` matrix has changed from ``DensePhiMatrix`` to ``AttachedPhiMatrix``. + +Now let's assume that you have created ``pwt_new`` matrix with the same size, filled with custom values. Let's write these values into our :math:`\Phi` matrix. + +.. note:: + + You need to write the values by accessing ``phi_ref`` variable, you are not allowed to assing it the whole ``pwt_new`` matrix, this operation will lead to an error in future work. + + .. code-block:: python + + for tok in xrange(num_tokens): + for top in xrange(num_topics): + phi_ref[tok, top] = pwt_new[tok, top] # CORRECT! + + phi_ref = pwt_new # NO! + +After that you can print :math:`\Phi` matrix again and check the change of it's values. From this moment you can continue our work. diff --git a/docs/tutorials/python_userguide/base_plsa.txt b/docs/tutorials/python_userguide/base_plsa.txt index 1954b55f9..b31a83403 100644 --- a/docs/tutorials/python_userguide/base_plsa.txt +++ b/docs/tutorials/python_userguide/base_plsa.txt @@ -1,2 +1,89 @@ 2. Base PLSA Model with Perplexity Score ========= + +Detailed description of all parameters and methods of BigARTM Python API classes can be found in :doc:`../../api_references/python_interface`. + +At this moment you need to have next objects: + +- directory with ``my_collection_batches`` name, containing batches and dictionary in binary file ``my_dictionary.dict``; the directory should have the same location with your code file; +- ``Dictionary`` variable ``my_dictionary``, containing this dictionary (gathered or loaded); +- ``BatchVectorizer`` variable ``batch_vectorizer`` (the same we have created earlier). + +If everything is OK, let's start creating the model. Firstly you need to read the specification of the ``ARTM`` class, which represents the model. Then you can use the following code to create the model: + +.. code-block:: python + + model = artm.ARTM(num_topics=20, dictionary=my_dictionary) + +Now you have created the model, containing :math:`\Phi` matrix with size "number of words in your dictionary" :math:`\times` "number of topics" (20). This matrix was randomly initialized. Note, that by default the random seed for initialization is fixed to archive the ability to re-run the experiments and get the same results. If you want to have another random start values, use the seed parameter of the ``ARTM`` class (it's different non-negative integer values leads to different initializations). + +From this moment we can start learning the model. But typically it is useful to enable some scores for monitoring the quality of the model. Let’s use the perplexity now. + +You can deal with scores using the ``scores`` field of the ``ARTM`` class. The score of perplexity can be added in next way: + +.. code-block:: python + + model.scores.add(artm.PerplexityScore(name='my_fisrt_perplexity_score', + dictionary=my_dictionary)) + +Note, that perplexity should be enabled strongly in described way (you can change other parameters we didn't use here). You can read about it in :doc:`../scores_descr`. + +.. note:: + + If you try to create the second score with the same name, the ``add()`` call will be ignored. + +Now let's start the main act, e.g. the learning of the model. We can do that in two ways: using online algorithm or offline one. The corresponding methods are ``fit_online()`` and ``fit_offline()``. It is assumed, that you know the features of these algorithms, but I will briefly remind you: + +- **Offline algorithm**: many passes through the collection, one pass through the single document (optional), only one update of the :math:`\Phi` matrix on one collection pass (at the end of the pass). You should use this algorithm while processing a small collection. + +- **Online algorithm**: single pass through the collection (optional), many passes through the single document, several updates of the :math:`\Phi` matrix during one pass through the collection. Use this one when you deal with large collections, and with collections with quickly changing topics. + +We will use the offline learning here and in all further examples in this page (because the correct usage of the online algorithm is big skill). + +Well, let's start training: + +.. code-block:: python + + model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10) + +This code chunk had worked slower, than any previous one. Here we proceeded the first step of the learning, it will be useful to look at the perplexity. We need to use the ``score_tracker`` field of the ``ARTM`` class for this. It remember all the values of all scores on each :math:`\Phi` matrix update. These data can be retrieved using the names of scores. + +You can extract only the last value: + +.. code-block:: python + + print model.score_tracker['my_fisrt_perplexity_score'].last_value + +Or you are able to extract the list of all values: + +.. code-block:: python + + print model.score_tracker['my_fisrt_perplexity_score'].value + +If the perplexity had convergenced, you can finish the learning process. In other way you need to continue. As it was noted above, the rule to have only one pass over the single document in the online algorithm is optional. Both ``fit_offline()`` and ``fit_online()`` methods supports any number of document passes you want to have. To change this number you need to modify the corresponding parameter of the model: + +.. code-block:: python + + model.num_document_passes = 5 + +All following calls of the learning methods will use this change. Let's continue fitting: + +.. code-block:: python + + model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15) + +We continued learning the previous model by making 15 more collection passes with 5 document passes. + +You can continue to work with this model in described way. Now one note: if you understand in one moment that your model had degenerated, and you don't want to create the new one, then use the ``initialize()`` method, that will fill the :math:`\Phi` matrix with random numbers and won't change any other things (nor your tunes of the regularizers/scores, nor the history from ``score_tracker``): + +.. code-block:: python + + model.initialize(dictionary=my_dictionary) + +FYI, this method is calling in the ``ARTM`` constructor, if you give it the dictionary name parameter. Note, that the change of the seed field will affect the call of ``initialize()``. + +Also note, that you can pass the name of the dictionary instead of the dictionary object whenever it uses. + +.. code-block:: python + + model.initialize(dictionary=my_dictionary.name) diff --git a/docs/tutorials/python_userguide/different.txt b/docs/tutorials/python_userguide/different.txt new file mode 100644 index 000000000..5fa7cd8e1 --- /dev/null +++ b/docs/tutorials/python_userguide/different.txt @@ -0,0 +1,41 @@ +Different Useful Techniques +======= + +* **Dictionary filtering**: + +In this section we'll discuss dictionary's self-filtering ability. Let's remember the structure of the dictionary, saved in textual format (see :doc:`m_artm`). There are many lines, one per each unique token, and each line contains 5 values: token (string), its class_id (string), its value (double) and two more integer parameters, called token_tf and token_df. token_tf is an absolute frequency of the token in the whole collection, and token_df is the number of documents in the collection, where the token had appeared at least once. These values are generating during gathering dictionary by the library. They differ from the value in the fact, that you can't use them in the regularizers and scores, so you shouldn't change them. + +They need for filtering of the dictionary. You likely needn't to use very seldom or too frequent tokens in your model. Or you simply want to reduce your dictionary to hold your model in the memory. In both cases the solution is to use the ``Dictionary.filter()`` method. See its parameters in :doc:`../../api_references/python_interface`. Now let's filter the modality of usual tokens: + +.. code-block:: python + + dictionary.filter(min_tf=10, max_tf=2000, min_df_rate=0.01) + +.. note:: + If the parameter has \_rate suffix, it denotes relative value (e.g. from 0 to 1), otherwise - absolute value. + +This call has one feature, it rewrites the old dictionary with new one. So if you don't want to lose your full dictionary, you need firstly to save it to disk, and then filter the copy located in the memory. + +* **Saving/loading model**: + +Now let's study saving the model to disk. + +It's important to understand that the model contains two matrices: :math:`\Phi` (or :math:`p_{wt}`) and :math:`n_{wt}`. To make model be loadable without loses you need to save both these matrices. The current library version can save only one matrix per method call, so you will need two calls: + +.. code-block:: python + + model.save(filename='saved_p_wt', model_name='p_wt') + model.save(filename='saved_n_wt', model_name='n_wt') + +The model will be saved in binary format. To use it later you need to load it's matrices back: + +.. code-block:: python + + model.load(filename='saved_p_wt', model_name='p_wt') + model.load(filename='saved_n_wt', model_name='n_wt') + +.. note:: + + The model after loading will only contain :math:`\Phi` and :math:`n_{wt}` matrices and some associated information (like number of topics, their names, the names of the modalities (without weights!) and some other data). So you need to restore all necessary scores, regularizers, modality weights and all important parameters, like ``cache_theta``. + +You can use ``save/load`` methods pair in case of long fitting, when restoring parameters is much more easier than model re-fitting. diff --git a/docs/tutorials/python_userguide/index.txt b/docs/tutorials/python_userguide/index.txt index abe3ee5a2..8d38f86bb 100644 --- a/docs/tutorials/python_userguide/index.txt +++ b/docs/tutorials/python_userguide/index.txt @@ -18,5 +18,7 @@ Python Guide phi_theta_extraction coherence attach_model + ptdw + different .. vim:ft=rst diff --git a/docs/tutorials/python_userguide/loading_data.txt b/docs/tutorials/python_userguide/loading_data.txt index f6486995f..994102b62 100644 --- a/docs/tutorials/python_userguide/loading_data.txt +++ b/docs/tutorials/python_userguide/loading_data.txt @@ -1,2 +1,103 @@ 1. Loading Data: BatchVectorizer and Dictionary ======= + +Detailed description of all parameters and methods of BigARTM Python API classes can be found in :doc:`../../api_references/python_interface`. + +* **BatchVectorizer**: + +Before starting modeling we need to convert you data in the library format. At first you need to read about supporting formats for source data in :doc:`../datasets`. It's your task to prepare your data in one of these formats. As you had transformed your data into one of source formats, you can convert them in the BigARTM internal format (batches of documents) using ``BatchVectorizer`` class object. + +Really you have one more simple way to process your collection, if it is not too big and you don't need to store it in batches. To use it you need to archive two variables: ``numpy.ndarray`` ``n_wd`` with :math:`n_{wd}` counters and corresponding Python dict with vocabulary (key - index of ``numpy.ndarray``, value - corresponding token). The simpliest way to get these data is sklearn ``CountVectorizer`` usage (or some similar class from sklearn). + +If you have archived described variables run following code: + +.. code-block:: python + + batch_vectorizer = artm.BatchVectorizer(data_format='bow_n_wd', + n_wd=n_wd, + vocabulary=vocabulary) + +Well, if you have data in UCI format (e.g. ``vocab.my_collection.txt`` and ``docword.my_collection.txt`` files), that were put into the same directory with your script or notebook, you can create batches using next code: + +.. code-block:: python + + batch_vectorizer = artm.BatchVectorizer(data_path='', + data_format='bow_uci', + collection_name='my_collection', + target_folder='my_collection_batches') + +The built-in library parser converted your data into batches and covered them with the ``BatchVectorizer`` class object, that is a general input data type for all methods of Python API. The batches were places in the directory, you specified in the ``target_folder`` parameter. + +If you have the source file in the Vowpal Wabbit data format, you can use the following command: + +.. code-block:: python + + batch_vectorizer = artm.BatchVectorizer(data_path='', + data_format='vowpal_wabbit', + target_folder='my_collection_batches') + +The result is fully the same, as it was described above. + +.. note:: + + If you had created batches ones, you shouldn't launch this process any more, because it spends many time while dealing with large collection. You can run the following code instead. It will create the ``BatchVectorizer`` object using the existing batches (this operation is very quick): + + .. code-block:: python + + batch_vectorizer = artm.BatchVectorizer(data_path='my_collection_batches', + data_format='batches') + +* **Dictionary**: + +The next step is to create ``Dictionary``. This is a data structure containing the information about all unique tokens in the collection. The dictionary is generating outside the model, and this operation can be done in different ways (load, create, gather). The most basic case is to gather dictionary using batches directory. You need to do this operation only once when starting working with new collection. Use the following code: + +.. code-block:: python + + dictionary = artm.Dictionary() + dictionary.gather(data_path='my_collection_batches') + +In this case the token order in the dictionary (and in further :math:`\Phi` matrix) will be random. If you'd like to specify some order, you need to create the vocab file (see UCI format), containing all unique tokens of the collection in necessary order, and run the code below (assuming your file has ``vocab.txt`` name and located in the same directory with your code): + +.. code-block:: python + + dictionary = artm.Dictionary() + dictionary.gather(data_path='my_collection_batches', + vocab_file_path='vocab.txt') + +Take into consideration the fact that library will ignore any token from batches, that was not presented into vacab file, if you used it. ``Dictionary`` contains a lot of useful information about the collection. For example, each unique token in it has the corresponding variable - value. When BigARTM gathers the dictionary, it puts the relative frequency of this token in this variable. You can read about the use-cases of this variable in further sections. + +Well, now you have a dictionary. It can be saved on the dick to prevent it's re-creation. You can save it in the binary format: + +.. code-block:: python + + dictionary.save(dictionary_path='my_collection_batches/my_dictionary') + +Or in the textual one (if you'd like to see the gathered data, for example): + +.. code-block:: python + + dictionary.save_text(dictionary_path='my_collection_batches/my_dictionary.txt') + +Saved dictionary can be loaded back. The code for binary file looks like next one: + +.. code-block:: python + + dictionary.load(dictionary_path='my_collection_batches/my_dictionary.dict') + +For textual dictionary you can run the next code: + +.. code-block:: python + + dictionary.load_text(dictionary_path='my_collection_batches/my_dictionary.txt') + +Besides looking the content of the textual dictionary, you also can moderate it (for example, change the value of value field). After you load the dictionary back, these changes will be used. + +.. note:: + + All described ways of generating batches automatically generate dictionary. You can use it by typing: + + .. code-block:: python + + batch_vectorizer.dictionary + + If you don't want to create this dictionary, set ``gather_dictionary`` parameter in the constructor of ``BatchVectorizer`` to False. But this flag will be ignored if ``data_format`` == ``bow_n_wd``, as it is the only possible way to generate dictionary in this case. diff --git a/docs/tutorials/python_userguide/m_artm.txt b/docs/tutorials/python_userguide/m_artm.txt index dfe9a5e52..304ef4acc 100644 --- a/docs/tutorials/python_userguide/m_artm.txt +++ b/docs/tutorials/python_userguide/m_artm.txt @@ -1,4 +1,58 @@ 4. Multimodal Topic Models ======= +Detailed description of all parameters and methods of BigARTM Python API classes can be found in :doc:`../../api_references/python_interface`. +Now let's move to more complex cases. In last section цу mentioned the term `modality`. It's something that corresponds to each token. We prefer to think about it as about the type of token, For instance, some tokens form the main text in the document, some form the title, some from names of the authors, some from tags etc. + +In BigARTM each unique token has a modality. It is denoted as ``class_id`` (don't confuse with the classes in the task of classification). You can specify the ``class_id`` of the token, or the library will set it to ``@default_class``. This class id denotes the type of usual tokens, the type be default. + +In the most cases you don't need to use the modalities, but there're some situations, when they are indispensable. For example, in the task of document classification. Strictly speaking, we will talk about it now. + +You need to re-create all the data with considering the presence of the modalities. Your task is to create the file in the Vowpal Wabbit format, where each line is a document, and each document contains the tokens of two modalities - the usual tokens, and the tokens-labels of classes, the document belongs to. + +Now follow again the instruction from the introduction part, dealing with your new Vowpal Wabbit file to achieve batches and the dictionary. + +The next step is to explain your model the information about your modalities and the power in the model. Power of the modality is it's coefficient :math:`\tau_m` (we assume you know about it). The model uses by default only tokens of ``@default_class`` modality and uses it with :math:`\tau_m` = 1.0. You need to specify other modalities and their weights in the constructor of the model, using following code, if you need to use these modalities: + +.. code-block:: python + + model = artm.ARTM(num_topics=20, class_ids={'@default_class': 1.0, '@labels_class': 5.0}) + +Well, we asked the model to take into consideration these two modalities, and the class labels will be more powerful in this model, than the tokens of the ``@default_class`` modality. Note, that if you had the tokens of another modality in your file, they wouldn't be taken into consideration. Similarly, if you had specified in the constructor the modality, that doesn't exist in the data, it will be skipped. + +Of course, the ``class_ids`` field, as all other ones, can be reseted. You always can change the weights of the modalities: + +.. code-block:: python + + model.class_ids = {'@default_class': 1.0, '@labels_class': 50.0} + # model.class_ids['@labels_class'] = 50.0 --- NO!!! + +You need to update the weights directly in such way, don't try to refer to the modality by the key directly: ``class_ids`` can be updated using Python dict, but it is not the dict. + +The next launch of ``fit_offline()`` or ``fit_online()`` will take this new information into consideration. + +Now we need to enable scores and regularizers in the model. This process was viewed earlier, excluding one case. All the scores of :math:`\Phi` matrix (and perplexity) and :math:`\Phi` regularizers has fields to deal with modalities. Through these fields you can define the modalities to be deal with by score or regularizer, the other ones will be ignored (here's the full similarity with ``topic_names`` field). + +The modality field can be ``class_id`` or ``class_ids``. The first one is the string containing the name of the modality to deal with, the second one is a list of strings. + +.. note:: + + The missing value of ``class_id`` means ``class_id`` = ``@default_class``, missing value of ``class_ids`` means usage of all existing modalities. + +Let's add the score of sparsity :math:`\Phi` for the modality of class labels and regularizers of topic decorrelation for each modality, and start fitting: + +.. code-block:: python + + model.scores.add(artm.SparsityPhiScore(name='sparsity_phi_score', + class_id='@labels_class')) + + model.regularizers.add(artm.DecorrelatorPhiRegularizer(name='decorrelator_phi_def', + class_ids=['@default_class'])) + + model.regularizers.add(artm.DecorrelatorPhiRegularizer(name='decorrelator_phi_lab', + class_ids=['@labels_class'])) + + model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10) + +Well, we will leave you the rest of the work (tuning :math:`\tau` and :math:`\tau_m` coefficients, looking at scores results etc.). In :doc:`phi_theta_extraction` we will go to the usage of fitted ready model for the classification of test data. diff --git a/docs/tutorials/python_userguide/phi_theta_extraction.txt b/docs/tutorials/python_userguide/phi_theta_extraction.txt index ea6fa7642..b4fbc906a 100644 --- a/docs/tutorials/python_userguide/phi_theta_extraction.txt +++ b/docs/tutorials/python_userguide/phi_theta_extraction.txt @@ -1,2 +1,48 @@ 5. Phi and Theta Extraction. Transform Method ======= + +Detailed description of all parameters and methods of BigARTM Python API classes can be found in :doc:`../../api_references/python_interface`. + +* **Phi/Theta exraction** + +Let's assume, that you have a data and a model, fitted on this data. You had tuned all necessary regularizers and used scores. But the set of quality measures of the library wasn't enough for you, and you need to compute your own scores using :math:`\Phi` and :math:`\Theta` matrices. In this case you are able to extract these matrices using next code: + +.. code-block:: python + + phi = model.get_phi() + theta = model.get_theta() + +Note, that you need a ``cache_theta`` flag to be set True if you are planning to extract :math:`\Theta` in the future without using ``transform()``. You also can extract not whole matrices, but part of them, that corresponds different topics (using the same ``topic_names`` parameter of the methods, as in previous sections). Also you can extract only necessary modalities of the :math:`\Phi` matrix, if you want. + +Both methods return ``pandas.DataFrame``. + +* **Transform new documents** + +Now we will go to the usage of fitted ready model for the classification of test data (to do it you need to have a fitted multimodal model with ``@labels_class`` modality, see :doc:`m_artm`). + +In the classification task you have the train data (the collection you used to train your model, where for each document the model knew it's true class labels), and test one. For the test data true labels are known to you, but are unknown to the model. Model need to forecast these labels, using test documents, and your task is to compute the quality of the predictions by counting some metrics, AUC, for instance. + +Computation of the AUC or any other quality measure is your task, we won't do it. Instead, we will learn how to get :math:`p(c|d)` vectors for each document, where each value is the probability of class :math:`c` in the given document :math:`d`. + +Well, we have a model. We assume you put test documents into separate file in Vowpal Wabbit format, and created batches using it, which are covered by the variable ``batch_vectorizer_test``. Also we assume you have saved your test batches into the separate directory (not into the one containing train batches). + +Your test documents shouldn't contain information about true labels (e.g. the Vowpal Wabbit file shouldn't contain string '|@labels_class'), also text document shouldn't contain tokens, that doesn't appear in the train set. Such tokens will be ignored. + +If all these conditions are met, we can use the ``ARTM.transform()`` method, that allows you to get :math:`p(t|d)` (e.g. :math:`\Theta`) or :math:`p(c|d)` matrix for all documents from your ``BatchVectorizer`` object. + +Run this code to get :math:`\Theta`: + +.. code-block:: python + + theta_test = model.transform(batch_vectorizer=batch_vectorizer_test) + +And this one to achieve :math:`p(c|d)`: + +.. code-block:: python + + p_cd_test = model.transform(batch_vectorizer=batch_vectorizer_test, + predict_class_id='@labels_class') + +In this way you have got the predictions of the model in ``pandas.DataFrame``. Now you can score the quality of the predictions of your model in all ways, you need. + +Method allows you to extract dense od sparse matrix. Also you can use for :math:`p_{tdw}` matrix (see :doc:`ptdw`). diff --git a/docs/tutorials/python_userguide/ptdw.txt b/docs/tutorials/python_userguide/ptdw.txt new file mode 100644 index 000000000..ae0096e1a --- /dev/null +++ b/docs/tutorials/python_userguide/ptdw.txt @@ -0,0 +1,2 @@ +8. Deal with Ptdw Matrix +======= diff --git a/docs/tutorials/python_userguide/regularizers_and_scores.txt b/docs/tutorials/python_userguide/regularizers_and_scores.txt index f3ba52711..75a2eebfb 100644 --- a/docs/tutorials/python_userguide/regularizers_and_scores.txt +++ b/docs/tutorials/python_userguide/regularizers_and_scores.txt @@ -1,2 +1,134 @@ 3. Regularizers and Scores Usage ========= + +Detailed description of all parameters and methods of BigARTM Python API classes can be found in :doc:`../../api_references/python_interface`. Description of regularizers can be found in :doc:`../regularizers_descr`. + +The library has a pre-defined set of the regularizers (you can create new ones, if it's necessary, you can read about it in the corresponding notes in :doc:`../../devguide/create_regularizer`). Now we’ll study to use them. + +We assume that all the conditions from the head of the section :doc:`base_plsa` are executed. Let's create the model and enable the perplexity score in it: + +.. code-block:: python + + model = artm.ARTM(num_topics=20, dictionary=my_dictionary, cache_theta=False) + model.scores.add(artm.PerplexityScore(name='perplexity_score', + dictionary=my_dictionary)) + +I should note the the cache_theta flag: it's allow you to save your :math:`\Theta` matrix in the memory or not. If you have large collection, it can be impossible to store it's :math:`\Theta` in the memory, and in case of short collection it can be useful to look at it. Default value is True. In the cases, when you need to use :math:`\Theta` matrix, but it is too big, you can use ``ARTM.transform()`` method (it will be discussed later). + +Now let's try to add other scores, because the perplexity is not the only one to be used. + +Let's add the scores of sparsity of :math:`\Phi` and :math:`\Theta` matrices and the information about the most probable tokens in each topic (top-tokens): + +.. code-block:: python + + model.scores.add(artm.SparsityPhiScore(name='sparsity_phi_score')) + model.scores.add(artm.SparsityThetaScore(name='sparsity_theta_score')) + model.scores.add(artm.TopTokensScore(name='top_tokens_score')) + +Scores have many useful parameters. For instance, they can be calculated on the subsets of topics. Let's count separately the sparsity of the first ten topics in :math:`\Phi`. But there's a problem: topics are identifying with their names, and we didn't specify them. If we used the ``topic_names`` parameter in the constructor (instead of ``num_topics`` one), we should have such a problem. But the solution is very easy: BigARTM had generated names and put them into the ``topic_names`` field, so you can use it: + +.. code-block:: python + + model.scores.add(artm.SparsityPhiScore(name='sparsity_phi_score_10_topics', topic_names=model.topic_names[0: 9])) + +Certainly, we could modify the previous score without creating new one, if the general model sparsity wasn't interesting for us: + +.. code-block:: python + + model.scores['sparsity_phi_score'].topic_names = model.topic_names[0: 9] + +But let's assume that we are also interested in it and keep everything as is. You should remember that all the parameters of metrics, model and regularizers (we will talk about them soon) can be set and reset by the direct change of the corresponding field, as it was demonstrated in the code above. + +For example, let's ask the top-tokens score to show us 12 most probable tokens in each topic: + +.. code-block:: python + + model.num_tokens = 12 + +Well, we achieved the model covered with necessary scores, and can start the fitting process: + +.. code-block:: python + + model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10) + +We saw this code in the first section. But now we can see the values of new added scores: + +.. code-block:: python + + print model.score_tracker['perplexity_score'].value # .last_value + print model.score_tracker['sparsity_phi_score'].value # .last_value + print model.score_tracker['sparsity_theta_score'].value # .last_value + +As we can see, all the scores didn't change. But we forgot about the top-tokens. Here we need to act more accurately: the score stores the data on each moment of :math:`\Phi` update. Let's assume that we need only the last data. So we need to use the `last_tokens` field. It is a Python dict, where key is a topic name, and value is a list of top-tokens of this topic. + +.. note:: + + The scores are loading from the kernel on each call, so for such a big scores, as top-tokens (or topic kernel score), it's strongly recommended to store the whole score in the local variable, and then deal with it. So, let's look through all top-tokens in the loop: + + .. code-block:: python + + saved_top_tokens = model.score_tracker['top_tokens_score'].last_tokens + + for topic_name in model.topic_names: + print saved_top_tokens[topic_name] + +Probably the topics are not very good. For the aim of increasing the quality of the topics you can use the regularizers. The code for dealing with the regularizers is very similar with the one for scores. Let's add three regularizers into our model: sparsing of :math:`\Phi` matrix, sparsing of :math:`\Theta` matrix and topics decorrelation. The last one is need to make topics more different. + +.. code-block:: python + + model.regularizers.add(artm.SmoothSparsePhiRegularizer(name='sparse_phi_regularizer')) + model.regularizers.add(artm.SmoothSparseThetaRegularizer(name='sparse_theta_regularizer')) + model.regularizers.add(artm.DecorrelatorPhiRegularizer(name='decorrelator_phi_regularizer')) + +Maybe you have a question about the name of the ``SmoothSparsePhi\Theta`` regularizer. Yes, it can both smooth and sparse topics. It's action depends on the value of corresponding coefficient of the regularization :math:``tau`` (we assume, that you know, what is it). ``tau`` > 0 leads to smoothing, ``tau`` < 0 to sparsing. By default all the regularizers has ``tau`` = 1, which is usually not what you want. Choosing good ``tau`` is a heuristic, sometimes you need to process dozens of the experiments to pick up good values. It is the experimental work, and we won't discuss it here. Let's look at technical details instead: + +.. code-block:: python + + model.regularizers['sparse_phi_regularizer'].tau = -1.0 + model.regularizers['sparse_theta_regularizer'].tau = -0.5 + model.regularizers['decorrelator_phi_regularizer'].tau = 1e+5 + +We set standard values, but in bad case they can be useless or even harmful for the model. + +We draw your attention again to the fact, that setting and changing the values of the regularizer parameters is fully similar to the scores. + +Let's start the learning process: + +.. code-block:: python + + model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10) + +Further you can look at metrics, change ``tau`` coefficients of the regularizers and etc. As for scores, you can ask the regularizer to deal only with given topics, using ``topic_names`` parameter. + +Let's return to the dictionaries. But here’s one discussion firstly. Let's look at the principle of work of the ``SmoothSparsePhi`` regularizer. It simply adds to all counters the same value ``tau``. Such a strategy can be unsuitable for us. The probable case: a need for sparsing one part of words, smoothing another one and ignoring the rest tokens. For example, let's sparse the tokens about `magic`, smooth tokens about `cats` and ignore all other ones. + +In this situation we need dictionaries. + +Let's remember about the value field, that corresponds each unique token. And also the fact, that ``SmoothSparsePhi`` regularizer has the ``dictionary`` field. If you set this field, the regularizer will add to counters ``tau`` * ``value`` for this token, instead of ``tau``. In such way we can set the ``tau`` to 1, for instance, set the ``value`` variable in dictionary for tokens about `magic` equal to -1.0, for tokens about `cats` equal to 1.0, and 0.0 for other tokens. And we'll get what we need. + +The last problem is how to change these `value` variables. It was discussed in the :doc:`loading_data`: let's remember about the methods ``Dictionary.save_text()`` and ``Dictionary.load_text()``. + +You need to proceed next steps: + +- save the dictionary in the textual format; +- open it, each line corresponds to one unique token, the line contains 5 values: ``token`` - ``modality`` - ``value`` - ``token_tf`` - ``token_df``; +- don't pay attention to anything except the token and the value; find all tokens you are interested in and change their values parameters; +- load the dictionary back into the library. + +Your file can have such a view after editing (conceptually): + +.. code-block:: none + + cat smth 1.0 smth smth + shower smth 0.0 smth smth + magic smth -1.0 smth smth + kitten smth 1.0 smth smth + merlin smth -1.0 smth smth + moscow smth 0.0 smth smth + +All the code you need to process discussed operation was showed above. Here is an example of creation of the regularizer with dicitonary: + +.. code-block:: python + + model.regularizer.add(artm.SmoothSparsePhiRegularizer(name='smooth_sparse_phi_regularizer', + dictionary=my_dictionary)) diff --git a/docs/tutorials/scores_descr.txt b/docs/tutorials/scores_descr.txt index 9056d5d8b..ee39e4b4d 100644 --- a/docs/tutorials/scores_descr.txt +++ b/docs/tutorials/scores_descr.txt @@ -32,7 +32,7 @@ Computes the ratio of elements of :math:`\Phi` matrix (or it's part) that are le * **Usage:** -One of the goals of regularization is to archive a sparse structure of :math:`\Phi` matrix using different sparsing regularizers. This scores allows to control this process. While using different regularization stratgies in different parts of model you can create a score per each part and one for whole model to have detailed and whole values. +One of the goals of regularization is to achive a sparse structure of :math:`\Phi` matrix using different sparsing regularizers. This scores allows to control this process. While using different regularization stratgies in different parts of model you can create a score per each part and one for whole model to have detailed and whole values. Sparsity Theta @@ -44,7 +44,7 @@ Computes the ratio of elements of :math:`\Theta` matrix (or it's part) that are * **Usage:** -One of the goals of regularization is to archive a sparse structure of :math:`\Theta` matrix using different sparsing regularizers. This scores allows to control this process. While using different regularization stratgies in different parts of :math:`\Theta` you can create a score per each part and one for whole matrix to have detailed and whole values. +One of the goals of regularization is to achive a sparse structure of :math:`\Theta` matrix using different sparsing regularizers. This scores allows to control this process. While using different regularization stratgies in different parts of :math:`\Theta` you can create a score per each part and one for whole matrix to have detailed and whole values. Top Tokens @@ -71,7 +71,7 @@ Topic Kernel Scores * **Description:** -This score was created as one more way to control the inpretability of the topics. Let's define the `topic kernel` as :math:`W_t = big\{ w \in W | p(t|w) > \mathrm{threashold} big\}`, and determine several measures, based on it: +This score was created as one more way to control the inpretability of the topics. Let's define the `topic kernel` as :math:`W_t = \big\{ w \in W | p(t|w) > \mathrm{threashold} \big\}`, and determine several measures, based on it: - :math:`\sum_{w \in W_t} p(w|t)` - the `purity` of the topic (higher values corresponds better topics); - :math:`\cfrac{1}{|W_t|}\sum_{w \in W_t} p(t|w)` - `contrast` of the topic (higher values corresponds better topics); @@ -91,7 +91,7 @@ Topic Mass * **Description:** -Computes the :math:`n_t` values for each requested topic in :math:`Phi` matrix. +Computes the :math:`n_t` values for each requested topic in :math:`\Phi` matrix. * **Usage:**