# Mallet

Another algorithm for topic modeling is implemented in the java-based software Mallet. For this to work **you need to download and install Mallet** from http://mallet.cs.umass.edu/download.php.
Mallet uses plain text as input, so none of the preprocessing steps of this package are available for Mallet topic modeling as of yet.

#### Loading modules from DARIAH-Topics library
First, we have to get access to the functionalities of the library by importing them. For using its functions we use the prefix of the toolbox's submodules (pre, visual and mallet).

In [1]:
from dariah_topics import preprocessing as pre
from dariah_topics import visualization as visual
from dariah_topics import mallet as mal

#### Activating inline output in Jupyter notebook
The following line will just tell the notebook to show graphics in the output frames.

In [2]:
%matplotlib inline

  return f(*args, **kwds)
  @inputhook_manager.register('osx')
  @inputhook_manager.register('wx')
  @inputhook_manager.register('qt', 'qt4')
  @inputhook_manager.register('qt5')
  @inputhook_manager.register('gtk')
  @inputhook_manager.register('tk')
  @inputhook_manager.register('glut')
  @inputhook_manager.register('pyglet')
  @inputhook_manager.register('gtk3')


## 1. Setting the parameters

#### Define path to corpus folder

In [3]:
path_to_corpus = "corpus_txt"

#### Path to mallet folder 

Now we must tell the library where to find the local instance of mallet. If you managed to install Mallet, it is sufficient set `path_to_mallet = "mallet"`, if you just store Mallet in a local folder, you have to specify the path to the binary explictly.

In [4]:
path_to_mallet = "/home/steffen/Software/mallet/bin/mallet"

#### Output folder

In [5]:
outfolder = "tutorial_supplementals/mallet_output"

#### Stopword list

In the current workflow, Mallet can filter out a given stop word list during its internal preprocessing.

In [6]:
stoplist = "tutorial_supplementals/stoplist/en.txt"

#### The Mallet corpus model

Finally, we can give all these folder paths to a Mallet function that handles all the preprocessing steps and creates a Mallet-specific corpus model object.

In [None]:
mallet_model = mal.create_mallet_model(path_to_mallet = path_to_mallet, 
                                       outfolder = outfolder,
                                       path_to_corpus = path_to_corpus
                                      )

In [None]:
mallet_model = mal.create_mallet_model(path_to_mallet = path_to_mallet, 
                                       outfolder = outfolder,
                                       path_to_corpus = path_to_corpus,
                                       remove_stopwords = "True", 
                                       stoplist = stoplist
                                      )

## 2. Model creation

**Warning: this step can take quite a while!** Meaning something between some seconds and some hours depending on corpus size and the number of passes.

In [None]:
output_folder = mal.create_mallet_output(path_to_mallet = path_to_mallet, 
                                         path_to_malletModel = mallet_model, 
                                         outfolder = outfolder
                                        )

### 2.4. Create document-topic matrix

The generated model object can now be translated into a human-readable document-topic matrix (that is a actually a pandas data frame) that constitutes our principle exchange format for topic modeling results. For generating the matrix from a Gensim model, we can use the following function:

In [None]:
doc_topic = mal.show_docTopicMatrix(output_folder, "doc_topics.txt")

## 3. Visualization

Now we can see the topics in the model with the following function:

In [None]:
mal.show_topics_keys(output_folder, topicsKeyFile = "topic_keys.txt")

### 3.1. Distribution of topics

#### Distribution of topics over all documents

The distribution of topics over all documents can now be visualized in a heat map:

In [None]:
heatmap = visual.doc_topic_heatmap(doc_topic.transpose())
heatmap.show()

#### Distribution of topics in a single documents

To take closer look on the topics in a single text, we can use the follwing function that shows all the topics in a text and their respective proportions. To select the document, we have to give its index to the function.

In [None]:
visual.plot_doc_topics(doc_topic.transpose(), 6)