# Testing `collection.py`
The following tutorial shows how to use the `collection` module of DARIAH-Topics.

## 1. Prearrangement
First you need to import the collection module so your IPython notebook has access to its functions and classes. As second step we set paths for a test corpus consisting of plain text files and one consisting of annotated text preprocessed with several NLP-Tools in form of CSV files (if you have questions concerning the format, click [here](https://github.com/DARIAH-DE/DARIAH-DKPro-Wrapper/blob/master/doc/tutorial.adoc)).

In [2]:
import collection

In [3]:
path_txt = "corpus_txt"
path_csv = "corpus_csv"

## 2. Creating list of filenames (plain text and csv files)
The following function is used to normalize path names so non-uniform text files will be processable by the module. It is possible to add an additional argument `ext` where you can specify an extension.


In [4]:
doclist_txt = collection.create_document_list(path_txt)
doclist_txt[:5]

20-Nov-2016 18:05:47 INFO collection: Creating document list from TXT files ...
20-Nov-2016 18:05:47 DEBUG collection: 17 entries in document list.


['corpus_txt/Doyle_AScandalinBohemia.txt',
 'corpus_txt/Doyle_AStudyinScarlet.txt',
 'corpus_txt/Doyle_TheHoundoftheBaskervilles.txt',
 'corpus_txt/Doyle_TheSignoftheFour.txt',
 'corpus_txt/Howard_GodsoftheNorth.txt']

In [5]:
doclist_csv = collection.create_document_list(path_csv, 'csv')
doclist_csv[:5]

20-Nov-2016 18:05:48 INFO collection: Creating document list from CSV files ...
20-Nov-2016 18:05:48 DEBUG collection: 16 entries in document list.


['corpus_csv/Doyle_AStudyinScarlet.txt.csv',
 'corpus_csv/Doyle_TheHoundoftheBaskervilles.txt.csv',
 'corpus_csv/Doyle_TheSignoftheFour.txt.csv',
 'corpus_csv/Howard_GodsoftheNorth.txt.csv',
 'corpus_csv/Howard_SchadowsinZamboula.txt.csv']

## 3. Getting document labels

In [6]:
labels = collection.get_labels(doclist_txt)
list(labels)[:5]

20-Nov-2016 18:05:53 INFO collection: Creating document labels ...
20-Nov-2016 18:05:53 DEBUG collection: Document labels available.


['Doyle_AScandalinBohemia.txt',
 'Doyle_AStudyinScarlet.txt',
 'Doyle_TheHoundoftheBaskervilles.txt',
 'Doyle_TheSignoftheFour.txt',
 'Howard_GodsoftheNorth.txt']

## 4. Load the corpora
By using the `read_from()`-functions we create a generator object which provides a memory efficient way to handle bigger corpora.

In [7]:
corpus_txt = collection.read_from_txt(doclist_txt)

In [8]:
corpus_csv = collection.read_from_csv(doclist_csv)

## 5. Segmenting text
An important part of pre-processing in Topic Modeling is segmenting the the texts in 'chunks'. The arguments of the function are for the targeted corpus and the size of the 'chunks' in words. Depending on the languange and type of text results can vary widely in quality.

In [9]:
segments = collection.segmenter(corpus_txt, 1000)
next(segments)

20-Nov-2016 18:06:04 DEBUG collection: Accessing TXT document ...
20-Nov-2016 18:06:04 INFO collection: Segmenting document ...
20-Nov-2016 18:06:04 DEBUG collection: Segment has a length of 1000 characters.


"A SCANDAL IN BOHEMIA\n\nA. CONAN DOYLE\n\n\nI\n\nTo Sherlock Holmes she is always _the_ woman. I have seldom heard him\nmention her under any other name. In his eyes she eclipses and\npredominates the whole of her sex. It was not that he felt any emotion\nakin to love for Irene Adler. All emotions, and that one particularly,\nwere abhorrent to his cold, precise but admirably balanced mind. He was,\nI take it, the most perfect reasoning and observing machine that the\nworld has seen; but as a lover, he would have placed himself in a false\nposition. He never spoke of the softer passions, save with a gibe and a\nsneer. They were admirable things for the observer--excellent for\ndrawing the veil from men's motives and actions. But for the trained\nreasoner to admit such intrusions into his own delicate and finely\nadjusted temperament was to introduce a distracting factor which might\nthrow a doubt upon all his mental results. Grit in a sensitive\ninstrument, or a crack in one of his own

## 6. Filtering text using POS-Tags
Another way to preprocess the text is by filtering by POS-Tags and using lemmas (in this case only adjectives, verbs and nouns are filterable). The annotated CSV-file we provide in this example is already enriched with this kind of information.


In [10]:
lemmas = collection.filter_POS_tags(corpus_csv)
next(lemmas)[:5]

20-Nov-2016 18:06:12 INFO collection: Accessing CSV documents ...
20-Nov-2016 18:06:12 INFO collection: Accessing ['ADJ', 'V', 'NN'] lemmas ...


37    typographical
56          textual
59           square
72              old
75             such
Name: Lemma, dtype: object

# 7. Creating a counter to use `find_stopwords()` and `find_hapax()`.

In [11]:
counter = collection.calculate_term_frequency(corpus_txt)

20-Nov-2016 18:06:20 INFO collection: Calculating term frequency ...
20-Nov-2016 18:06:20 DEBUG collection: Accessing TXT document ...
20-Nov-2016 18:06:20 DEBUG collection: Term frequency calculated.
20-Nov-2016 18:06:20 DEBUG collection: Accessing TXT document ...
20-Nov-2016 18:06:20 DEBUG collection: Term frequency calculated.
20-Nov-2016 18:06:20 DEBUG collection: Accessing TXT document ...
20-Nov-2016 18:06:20 DEBUG collection: Term frequency calculated.
20-Nov-2016 18:06:20 DEBUG collection: Accessing TXT document ...
20-Nov-2016 18:06:20 DEBUG collection: Term frequency calculated.
20-Nov-2016 18:06:20 DEBUG collection: Accessing TXT document ...
20-Nov-2016 18:06:20 DEBUG collection: Term frequency calculated.
20-Nov-2016 18:06:20 DEBUG collection: Accessing TXT document ...
20-Nov-2016 18:06:20 DEBUG collection: Term frequency calculated.
20-Nov-2016 18:06:20 DEBUG collection: Accessing TXT document ...
20-Nov-2016 18:06:20 DEBUG collection: Term frequency calculated.
20-Nov-

In [12]:
#
# Kopie des ursprünglichen Counters um im Testvorgang den Counter zurückzusetzen
# Auf diese Weise muss der Counter nicht jedes Mal erneut erstellt werden,
# sondern kann hier zurückgesetzt werden
#

countercopy = counter.copy()

## 8. Find stopwords

In [13]:
stopwords = collection.find_stopwords(countercopy, 50)
stopwords[:5]

20-Nov-2016 18:06:30 INFO collection: Finding stopwords ...
20-Nov-2016 18:06:30 DEBUG collection: 50 stopwords found.


Unnamed: 0,word,freq
0,the,21357
1,of,11614
2,and,11040
3,to,8516
4,a,7652


## 9. Find hapax legomena

In [14]:
hapax = collection.find_hapax(countercopy)
hapax[:5]

20-Nov-2016 18:06:35 INFO collection: Find hapax legomena ...
20-Nov-2016 18:06:36 DEBUG collection: 26096 hapax legomena found.


Unnamed: 0,word,freq
0,more!);,1
1,AVENGING,1
2,public's,1
3,surge.,1
4,mainland!--Come!',1


## 10. Remove stopwords

In [15]:
dict_without_stopwords = collection.remove_filtered_words(countercopy, stopwords)

20-Nov-2016 18:06:42 INFO collection: Removing stopwords ...
20-Nov-2016 18:06:42 DEBUG collection: 50 words removed.


In [16]:
print("Länge des counters: ", len(counter))
print("Länge des dicts nachdem Stopwords entfernt wurden: ", len(dict_without_stopwords))
print("25 MFWs von counter:\n", counter.most_common(25), "\n")
print("25 MFWs nachdem stopwords entfernt wurden:\n", dict_without_stopwords.most_common(25))
print("Anzahl der Wörter, die genau ein Mal vorkommen: ",len([count for count in dict_without_stopwords.values() if count == 1]))

Länge des counters:  43989
Länge des dicts nachdem Stopwords entfernt wurden:  43939
25 MFWs von counter:
 [('the', 21357), ('of', 11614), ('and', 11040), ('to', 8516), ('a', 7652), ('in', 5585), ('I', 5393), ('that', 4479), ('was', 4211), ('he', 3610), ('his', 3503), ('had', 2816), ('is', 2768), ('with', 2767), ('as', 2733), ('it', 2457), ('for', 2282), ('at', 2272), ('have', 2171), ('which', 2063), ('we', 1963), ('you', 1905), ('not', 1902), ('my', 1833), ('be', 1738)] 

25 MFWs nachdem stopwords entfernt wurden:
 [('up', 871), ('into', 860), ('so', 849), ('will', 841), ('there', 819), ('their', 802), ('It', 789), ('when', 725), ('very', 694), ('more', 680), ('some', 666), ('she', 652), ('if', 631), ('has', 628), ('who', 623), ('about', 622), ('down', 613), ('what', 608), ('We', 601), ('than', 585), ('any', 584), ('us', 565), ('your', 561), ('over', 540), ('like', 528)]
Anzahl der Wörter, die genau ein Mal vorkommen:  26096


## 11. Remove hapax legomena

In [17]:
dict_without_hapax = collection.remove_filtered_words(countercopy, hapax)

20-Nov-2016 18:06:57 INFO collection: Removing stopwords ...
20-Nov-2016 18:06:57 DEBUG collection: 26096 words removed.


In [18]:
print("Länge des counters: ", len(counter))
print("Länge des dicts nachdem Hapax entfernt wurden: ",len(dict_without_hapax))
print("Anzahl der Wörter, die öfter als ein Mal vorkommen: ", len([count for count in dict_without_hapax.values() if count > 1]))
print("Anzahl der Wörter, die genau ein Mal vorkommen: ",len([count for count in dict_without_hapax.values() if count == 1]))

Länge des counters:  43989
Länge des dicts nachdem Hapax entfernt wurden:  17843
Anzahl der Wörter, die öfter als ein Mal vorkommen:  17843
Anzahl der Wörter, die genau ein Mal vorkommen:  0


## 12. Visualization
Simple get-functions are implemented for visualization tasks. In this case the get_labels-function extracts the titles of the corpus files we loaded above.

In [19]:
lda_model = 'out_easy/corpus.lda'
corpus = 'out_easy/corpus.mm'
dictionary = 'out_easy/corpus.dict'
doc_labels = 'out_easy/corpus_doclabels.txt'
interactive  = False

vis = collection.Visualization(lda_model, corpus, dictionary, doc_labels, interactive)

20-Nov-2016 18:07:07 INFO collection: Accessing corpus ...
20-Nov-2016 18:07:07 INFO gensim.corpora.indexedcorpus: loaded corpus index from out_easy/corpus.mm.index
20-Nov-2016 18:07:07 INFO gensim.matutils: initializing corpus reader from out_easy/corpus.mm
20-Nov-2016 18:07:07 INFO gensim.matutils: accepted corpus with 17 documents, 514 features, 4585 non-zero entries
20-Nov-2016 18:07:07 DEBUG collection: Corpus available.
20-Nov-2016 18:07:07 INFO collection: Accessing model ...
20-Nov-2016 18:07:07 INFO gensim.utils: loading LdaModel object from out_easy/corpus.lda
20-Nov-2016 18:07:07 INFO gensim.utils: loading id2word recursively from out_easy/corpus.lda.id2word.* with mmap=None
20-Nov-2016 18:07:07 INFO gensim.utils: setting ignored attribute state to None
20-Nov-2016 18:07:07 INFO gensim.utils: setting ignored attribute dispatcher to None
20-Nov-2016 18:07:07 INFO gensim.utils: loading LdaModel object from out_easy/corpus.lda.state
20-Nov-2016 18:07:07 DEBUG collection: Model 

It is not possible to run `save_heatmap()` before `make_heatmap()`.

In [20]:
vis.save_heatmap("./visualizations/heatmap")

20-Nov-2016 18:07:10 INFO collection: Saving heatmap figure...
20-Nov-2016 18:07:10 ERROR collection: Run make_heatmap() before save_heatmp()


AttributeError: 'Visualization' object has no attribute 'heatmap_vis'

In [21]:
vis.make_heatmap()

20-Nov-2016 18:07:15 INFO collection: Accessing topic distribution ...
20-Nov-2016 18:07:15 DEBUG collection: Topic distribution available.
20-Nov-2016 18:07:15 INFO collection: Accessing topic probability ...
20-Nov-2016 18:07:15 DEBUG collection: Topic probability available.
20-Nov-2016 18:07:15 INFO collection: Accessing plot labels ...
20-Nov-2016 18:07:15 DEBUG collection: 10 plot labels available.
20-Nov-2016 18:07:15 INFO collection: Creating heatmap figure ...
20-Nov-2016 18:07:15 DEBUG collection: Heatmap figure available.


In [22]:
vis.save_heatmap("./visualizations/heatmap")

20-Nov-2016 18:07:16 INFO collection: Saving heatmap figure...
20-Nov-2016 18:07:16 DEBUG collection: Heatmap figure available at ./visualizations/heatmap/heatmap.png


In [23]:
vis = collection.Visualization(lda_model, corpus, dictionary, doc_labels, interactive=True)

20-Nov-2016 18:07:17 INFO collection: Accessing corpus ...
20-Nov-2016 18:07:17 INFO gensim.corpora.indexedcorpus: loaded corpus index from out_easy/corpus.mm.index
20-Nov-2016 18:07:17 INFO gensim.matutils: initializing corpus reader from out_easy/corpus.mm
20-Nov-2016 18:07:17 INFO gensim.matutils: accepted corpus with 17 documents, 514 features, 4585 non-zero entries
20-Nov-2016 18:07:17 DEBUG collection: Corpus available.
20-Nov-2016 18:07:17 INFO collection: Accessing model ...
20-Nov-2016 18:07:17 INFO gensim.utils: loading LdaModel object from out_easy/corpus.lda
20-Nov-2016 18:07:17 INFO gensim.utils: loading id2word recursively from out_easy/corpus.lda.id2word.* with mmap=None
20-Nov-2016 18:07:17 INFO gensim.utils: setting ignored attribute state to None
20-Nov-2016 18:07:17 INFO gensim.utils: setting ignored attribute dispatcher to None
20-Nov-2016 18:07:17 INFO gensim.utils: loading LdaModel object from out_easy/corpus.lda.state
20-Nov-2016 18:07:17 DEBUG collection: Model 

It is not possible to run `save_heatmap()` before `make_heatmap()`.

In [24]:
vis.save_interactive("./visualizations/interactive")

20-Nov-2016 18:07:18 INFO collection: Saving interactive visualization ...
20-Nov-2016 18:07:18 ERROR collection: Run make_interactive() before save_interactive()


AttributeError: 'Visualization' object has no attribute 'interactive_vis'

In [25]:
vis.make_interactive()

20-Nov-2016 18:07:19 INFO collection: Accessing model, corpus and dictionary ...
20-Nov-2016 18:07:19 DEBUG gensim.models.ldamodel: performing inference on a chunk of 17 documents
20-Nov-2016 18:07:19 DEBUG gensim.models.ldamodel: 6/17 documents converged within 50 iterations
20-Nov-2016 18:07:20 DEBUG collection: Interactive visualization available.


In [26]:
vis.save_interactive("./visualizations/interactive")

20-Nov-2016 18:07:20 INFO collection: Saving interactive visualization ...
20-Nov-2016 18:07:20 DEBUG collection: Interactive visualization available at ./visualizations/interactive/corpus_interactive.html and ./visualizations/interactive/corpus_interactive.json


![success](http://cdn2.hubspot.net/hub/128506/file-446943132-jpg/images/computer_woman_success.jpg)