# Using MALLET for Topic Modelling: An Introduction

MALLET (Machine Learning for Language Toolkit) is a Java-based toolkit for natural language processing, including document classification, clustering and topic modeling. It is a pure commandline tool, so it does not offer a graphical user interface (GUI) but has to be run from your terminal.

In this tutorial, you will learn when and how to use MALLET for topic modelling.

## Why use MALLET for topic modelling?

Imagine that you are interested in finding out which topics are covered in a larger corpus of texts. Reading all of it to extract the most relevant topics would take ages. It would also confront you with another question which might be difficult to answer before you have at least scanned the material: How many topics should you extract?
If you are working with a small number of files (or even a single document), you could of course perform a word-frequency analysis on the document and try to gauge the most important topics simply by looking at the results. But this approach will become less feasible for large corpora. This is where MALLET comes in: It assigns the words in a text to the most likely topics.

Let's go through this step-by-step and see how it works!

**Step 1: Installing MALLET**

Since MALLET is a commandline tool, we have to open a terminal in our codespace to install and run the application. In your VS Code editor, open a terminal and type `cd panel-8` to navigate to the `panel-8` folder, then type `ls` to display its contents. You will see that it contains a shell script called `setup_mallet.sh` containing all the commands needed to install mallet in your codespace. Let's execute the script! In your terminal, type `./setup_mallet.sh` to run the script. Congrats - you have now installed MALLET :)

**Step 2: Choose a text to analyse**

For MALLET to be able to process your texts, they should be in `.txt` format. MALLET comes with a wide range of test files which are already in the right format and which you can use in this tutorial. Just have a look: In your terminal or your file explorer (on the left), navigate to the location `mallet-2.0.8/sample-data`:

![Accessing MALLET's sample data](data/mallet_sample_data.png)

Alternatively, you can of course also upload texts of your own you would like to analyse. If your texts are not in `.txt` format, you can open them in a text editor of your choice (i.e. Notepad or Word) and save them with the `.txt` extension.

For the tutorial, we will use the files in `/sample_data/web/en`.
Let's look at the files with the help of the explorer! As you can see, there are several `.txt` files and all of them consist of a single line of text:

![Accessing sample texts](data/mallet_sample_text.png)

**Step 3: Navigate to the mallet folder**

For the rest of the tutorial, we will be executing all our commands in the terminal from the location `/workspaces/WORCK-DH-Winter-School-2024/panel-8/mallet-2.0.8`. The easiest way for you to get there in your codespace is to use the file explorer on the left: Open the folder `panel-8` and right click on `mallet-2.0.8`, then in the window that opens choose `Open in Integrated Terminal`:

![Open terminal in right location](data/mallet_open_terminal.png)


You can of course also execute the following commands from a different location, just be aware that you would have to adjust the paths then.

**Step 4: Convert your text to a format readable by MALLET**

Now that we have our data, we have to pass it over to MALLET to use it as input for topic modelling. In order to perform topic modelling, MALLET first needs to transform your text into still another format, a `.mallet` file. This file format is designed to represent the input data in a way that MALLET can process efficiently.

Let's tell MALLET to output the right data format!

Right click on the `panel-8` folder in your explorer and choose `New Folder`:

![Create output folder](data/mallet_create_output_folder.png)

Call the new folder `output`.

Then, in your terminal, type:

*   `bin/mallet import-dir --input sample-data/web/en --output ../output/corpus.mallet --keep-sequence --remove-stopwords`

Let's break down this code step-by-step:

* `bin/mallet`: This is the path to the Mallet tool executable. The bin directory typically contains executable files, and mallet is the Mallet command-line tool for topic modeling.

* `import-dir`: import a batch of files from a directory

* `--input sample-data/web/en`: Specifies the input file for the topic modeling process.

* `--output ../output/corpus.mallet`: Specifies the output file where the Mallet tool will store the processed data.

* `--keep-sequence`: This option instructs Mallet to preserve the order of words in the document. It's useful when the order of words matters, such as in the case of natural language text.

* `--remove-stopwords`: This option instructs Mallet to remove common stop words from the text data.


Now look into your `output` folder. Next to it, you should see the `.mallet` file you just created:

![.mallet file](data/mallet_mallet_format.png)

Your editor cannot read the content of the file, so you can't view it, but MALLET can now process it for us.

**Step 5: Use MALLET to extract topics**

We're now going to tell MALLET to perform topic modelling. In your terminal, type:

`bin/mallet train-topics --input ../output/corpus.mallet`

Let's break down the code step-by-step:

* `bin/mallet`: This is the path to the Mallet tool executable

* `train-topics`: This is the specific command within Mallet used for training topic models. It takes a Mallet-formatted input file, such as the one generated in the previous step (output/gunnhild.mallet), and trains a topic model on that data.

* `--input ../output/corpus.mallet`: Specifies the input file for the topic modeling training process.

MALLET now runs through your text multiple times to find the best division of words into topics. While it is doing this, your terminal window will fill with output from each run. When it is done, you can scroll up to see the created output. For each run, MALLET has printed out the key words: These are the words that define a statistically significant topic according to MALLET's analysis:

![Mallet train topics](data/mallet_train_topics.png)

In case you want to limit the number of topics MALLET outputs to a specific number, you can set the `num-topics` flag (i.e. to 7 topics) like this:

`bin/mallet train-topics --input ../output/corpus.mallet --num-topics 7`

Try it out!

Another option you should include is the `--optimize-interval` parameter since this will give you an idea about the weight of the topic. The command would look like this then:

`bin/mallet train-topics --input ../output/corpus.mallet --num-topics 7 --optimize-interval 7`

Try that out too!

**Step 6: Save the output in a file**

You can already see your results in the terminal, but wouldn't it be convenient to be able to save your results in a file so you are able to take a look at them later and share them? We can tell MALLET to do exactly that.

In your terminal, type:

`bin\mallet train-topics  --input ../output/corpus.mallet --output-state ../output/topic-state.gz --output-topic-keys output/corpus_keys.txt --output-doc-topics output/corpus_compostion.txt --optimize-interval 10`

Let's go through step-by-step:

* `bin\mallet train-topics`: Executes the train-topics command from the Mallet toolkit.

* `--input ../output/corpus.mallet`: Specifies the input file containing the preprocessed corpus data (assuming it's in the output directory).

* `--output-state ../output/topic-state.gz`: Specifies the location and filename for the output state file (topic assignments for each instance).

* `--output-topic-keys ../output/corpus_keys.txt`:
Specifies the location and filename for the output file containing the topic keys.

* `--output-doc-topics ../output/corpus_composition.txt`:
Specifies the location and filename for the output file containing document-topic distributions.

The files should now appear in your `output` folder next to `corpus.mallet`:

![Mallet output](data/mallet_output.png)

**Step 7: Look at the output**

Now we can use the file explorer to navigate to our `output` folder and open `corpus_keys.txt` to look at the result:

![Corpus Keys Output](data/mallet_corpus_keys_output.png)

After opening the file, you will see multiple paragraphs. The number at the beginning of each paragraph is the topic index. We begin counting at 0: The first paragraph is topic 0, the second paragraph is topic 1, ... . The second number in each paragraph is the Dirichlet parameter for the topic. It gives you an idea of the weight of a given topic. The words you see in each paragraph are the words which MALLET found to best describe a given topic.

**Step 8: Fine-tune your output**

MALLET offers a wide range of parameters you can set to improve the quality of your output according to your needs. So far, we have only used some of them like `keep-sequence` and `remove-stopwords`. If you want to find out which other parameters you can adjust, you can type:

`bin\mallet import-dir --help`:

![Import-Dir --help](data/mallet_import_dir_help.png)

This command will provide a list of the parameters you can set when you want to perform topic modelling for a batch of files (as we did above). If you want to pass your own unique list of stop words to MALLET which are appropriate for a certain time period, for example, this option might be interesting to you:

`--stoplist-file FILE
  Instead of the default list, read stop words from a file, one per line. Implies --remove-stopwords
  Default is null`

Or maybe you want to take into account n-grams of a certain size. You can specify this here:

`--gram-sizes INTEGER,[INTEGER,...]
  Include among the features all n-grams of sizes specified.  For example, to get all unigrams and bigrams, use --gram-sizes 1,2.  This option occurs after the removal of stop words, if removed.`

Just play around with some of the options!