# Tutorial for using the fancy_name_toolbox_for_mighty_topic_Modelling


The following tutorial will explain the usage of the python library "cophi_toolbox". If you have not done so, please follow the instructions installing jupyter and all python libraries mentioned in readme.txt/installation_instructions.

The toolbox provides two different approaches for preprocessing which both lead to tokenised text. After that you can use one of two topic modelling algorithms to create topic models for your corpus. The last step will be visualising the results.
In topic modelling preprocessing the data your topic modelling algorithm will handle is a mayor concern. Different kind of text will react differently to certain preprocessing steps and going furter, adapting the preprocessing is crucial for better topic model. Therefore, the cophi_toolbox can either take plain text(Step 1-?) or already nlp enhanced csv file(Step ? to ?) as input. For now the csv input is specified to use dkproWrapper output. For more information follow this link: https://github.com/DARIAH-DE/DARIAH-DKPro-Wrapper/tree/master/doc

## Step 1 Importing packages
Firstly, we get access to the functionalities of the toolbox by importing them. For using its functions we use the prefix of the toolbox's submodules (pre, visual and mallet)

In [2]:
from dariah_topics import preprocessing as pre
from dariah_topics import visualization as visual
from dariah_topics import mallet as mal

SyntaxError: invalid syntax (<ipython-input-2-3685325fe773>, line 1)

## Step 2: Loading txt-Files
We start with usiung plain text files, if you want to use nlp annotated data(csv files), skip to Step ?

With the second step we load the plain text corpous into memory for further preprocessing. The Tutorial includes an example set of data. Should you want to use your own corpus, change the path accordingly.

In [None]:
#creates a list of documents in the directory (in this case 'corpus_txt', a sample text corpus)

path_txt = "corpus_txt"

doclist_txt = pre.create_document_list(path_txt)

doclist_txt[:5]

In [None]:
#creates a list of filenames used for the datamodel and the visualisation
doc_labels = list(pre.get_labels(doclist_txt))
doc_labels[:5]

In [None]:
#reads/yields the text files of the corpus

corpus_txt = pre.read_from_txt(doclist_txt)

## Step 3: Tokenising
In this step we tokenise the text (into words), as topic modelling algorithms usually work on a bag-of-words model. The tokenise function the toolbox provides is a simple unicode tokeniser. Depending on the corpus it might be useful to import you own tokeniser since the efficiency varies language, epoch etc.


In [None]:
#tokenizes the corpus and returns a list of tokens of one text as an example

doc_tokens = [list(pre.tokenize(txt)) for txt in list(corpus_txt)]
doc_tokens[:1]

## Step 3.5: Working with dkproWrapper output
As mentioned above, you need to run the code listed in this Step to incorporate csv-Output created by dkproWrapper.
The output is already tokenised. Therefore it is unecessary to go through Step 3 "Tokenising".
As in step 2, we need a list of document names.
With the function provided the program reads only the most viable information from the csv file as default. You can pass an addition parameter with a list of column names(to read_from_csv; has to be of type 'list') to specify the information you retrieve.
As default, the filter_csv selects lemmas with the POS-Tags for adjective, verb and noun. By passing an argument (to filter_csv; has to be of type 'list') you can specify the value of the POS-Tags.

In [None]:
#creates a list of documents in the directory (in this case 'corpus_csv', a sample dkproWrapper output)
path_csv = "corpus_csv"

doclist_csv = pre.create_document_list(path_csv, 'csv')
doclist_csv[:5]

In [None]:
#creates a list of filenames used for the datamodel and the visualisation
doc_labels = list(pre.get_labels(doclist_txt))
doc_labels[:5]

In [None]:
#reads/yields the text files of the corpus. Only reads columns 'ParagraphId', 'TokenId', 'Lemma', 'CPOS', 'NamedEntity'
corpus_csv = pre.read_from_csv(doclist_csv)

# needs editing

In [None]:
#filters the dkproWrapper output to only add the lemmas of certain POS-Tags. (adjectives, verbs and nouns by default)
doc_tokens = #transform filter_csv into list of list

## Step 4: Transforming the data
With the code in the following sections the text corpus is transformed into a data structure similar to the used in gensim. This new data model allows a more efficient processing

In [None]:
#transforms the corpus into a marketmatrix
id_types, doc_ids = pre.create_dictionaries(doc_labels, doc_tokens)
sparse_bow = pre.create_mm(doc_labels, doc_tokens, id_types, doc_ids)

When you look at data structure by running the section below, you can see that each document and each word type of the corpus is represented by a number (column doc_id and token_id). In turn, each token_id has asigned a word count in the adjacent column, showing how often the word with asigned token_id occurs in the document no. 1.

In [None]:
#shows the datamodel
sparse_bow

# Step 5: Removing features
In this step we show several steps to populate a list of words, we want to ignore for the topic modelling step. One scenario could be to remove the most frequent words, as these are usually function words bare of semantic meaning. Another possibility is to remove hapax legomena (words occuring only once) which are usually considered noise in a topic model. The third is to simply use a stopword list.

In [None]:
#reads a txt file (one word per line) and creates a list
import os.path
basepath = os.path.abspath('.')

with open(os.path.join(basepath, "tutorial_supplementals", "stopwords", "en"), 'r', encoding = 'utf-8') as f: 
    stopword_list = f.read().split('\n')
    
stopword_list = set(stopword_list)

In [None]:
#removes stopwords
sparse_df_stopwords_removed = pre.remove_features(sparse_bow, id_types, stopword_list)

In [None]:
#creates a stopword list consisting of the 100 most frequent words
stopword_list = pre.find_stopwords(sparse_bow, id_types, 100)
len(stopword_list)

In [None]:
#creates a list of words only occuring once per document
hapax_list = pre.find_hapax(sparse_bow, id_types)
len(hapax_list)

In [None]:
#combines list of most frequent words and words occuring only once into one list
#cleans the corpus of the combined list
feature_list = set(stopword_list).union(hapax_list)
clean_term_frequency = pre.remove_features(sparse_bow, id_types, feature_list)

## Step 6: Topic modelling with Gensim

The following lines of code tranforms the data structure we created earlier into the format used by Gensim and saves it on disk. Following this code block we create the topic model. Depending on the size of the corpus used for it, lean back and get a coffee. Or a pot. A biig pot. Or go home and get some sleep :)

In [None]:
#imports gensim functions
from gensim.models import LdaModel
from gensim.corpora import MmCorpus

In [None]:
num_docs = max(sparse_bow.index.get_level_values("doc_id"))
num_types = max(sparse_bow.index.get_level_values("token_id"))
sum_counts = sum(sparse_bow[0])

header_string = str(num_docs) + " " + str(num_types) + " " + str(sum_counts) + "\n"

with open("gb_plain.mm", 'w', encoding = "utf-8") as f:
    pass

with open("gb_plain.mm", 'a', encoding = "utf-8") as f:
    f.write("%%MatrixMarket matrix coordinate real general\n")
    f.write(header_string)
    sparse_bow.to_csv( f, sep = ' ', header = None)

In [None]:
#
mm = MmCorpus("gb_plain.mm")
doc2id = {value : key for key, value in doc_ids.items()}
type2id = {value : key for key, value in id_types.items()}

model = LdaModel(corpus=mm, id2word=type2id, num_topics=60, alpha = "symmetric", passes = 10)

topic_nr_x = model.get_topic_terms(10)

[type2id[i[0]] for i in topic_nr_x]

topics = model.show_topics(num_topics = 60)

topics

## 6.5 Topic Modelling with mallet
Another algorithm for topic modelling is implemented in the java-based software mallet. For this to work you need to download and install mallet from http://mallet.cs.umass.edu/download.php.
Mallet uses plain text as input, so none of the preprocessing of this package is available for mallet topic modelling as of yet.
To use mallet for topic modelling via this Python Script, you need to pass paths to the mallet binary, input and output. Again, depending on the size of your corpus, after calling mal.create_MalletMatrix() you should get coffee... or tea, a lot of tea, a huge pot, believe me, it's great, it's fantastic.

In [None]:
#sets path to corpus and mallet binary and creates an output folder
path_to_corpus = os.path.join(os.path.abspath('.'), 'corpus_txt')
path_to_mallet = "insert_path_here"
malletBinary = mal.create_mallet_binary(path_to_corpus, path_to_mallet)

In [None]:
#sets Path to the output folder and works its magic
basepath = os.path.join(os.path.abspath('.'), "tutorial_supplementals/mallet_output")
doc_topics = os.path.join(basepath, "doc_topics.txt")


mal.create_MalletMatrix(doc_topics)

'''
         ,/   *
      _,'/_   |
      `(")' ,'/
   _ _,-H-./ /
   \_\_\.   /
     )" |  (
  __/   H   \__
  \    /|\    /
   `--'|||`--'
      ==^==

'''
