# Testing `collection.py`
The following tutorial shows how to use the `collection` module of DARIAH-Topics.

## 1. Prearrangement
First you need to import the collection module so your IPython notebook has access to its functions and classes. As second step we set paths for a test corpus consisting of plain text files and one consisting of annotated text preprocessed with several NLP-Tools in form of CSV files (if you have questions concerning the format, click [here](https://github.com/DARIAH-DE/DARIAH-DKPro-Wrapper/blob/master/doc/tutorial.adoc)).

In [2]:
import collection

In [3]:
path_txt = "corpus_txt"
path_csv = "corpus_csv"

## 2. Creating list of filenames (plain text and csv files)
The following function is used to normalize path names so non-uniform text files will be processable by the module. It is possible to add an additional argument `ext` where you can specify an extension.


In [4]:
doclist_txt = collection.create_document_list(path_txt)
doclist_txt[:5]

29-Nov-2016 12:02:13 INFO collection: Creating document list from TXT files ...
29-Nov-2016 12:02:13 DEBUG collection: 17 entries in document list.


['corpus_txt/Doyle_AScandalinBohemia.txt',
 'corpus_txt/Doyle_AStudyinScarlet.txt',
 'corpus_txt/Doyle_TheHoundoftheBaskervilles.txt',
 'corpus_txt/Doyle_TheSignoftheFour.txt',
 'corpus_txt/Howard_GodsoftheNorth.txt']

In [5]:
doclist_csv = collection.create_document_list(path_csv, 'csv')
doclist_csv[:5]

29-Nov-2016 12:02:15 INFO collection: Creating document list from CSV files ...
29-Nov-2016 12:02:15 DEBUG collection: 16 entries in document list.


['corpus_csv/Doyle_AStudyinScarlet.txt.csv',
 'corpus_csv/Doyle_TheHoundoftheBaskervilles.txt.csv',
 'corpus_csv/Doyle_TheSignoftheFour.txt.csv',
 'corpus_csv/Howard_GodsoftheNorth.txt.csv',
 'corpus_csv/Howard_SchadowsinZamboula.txt.csv']

## 3. Getting document labels

In [6]:
doc_labels = collection.get_labels(doclist_txt)
list(doc_labels)[:5]

29-Nov-2016 12:02:16 INFO collection: Creating document labels ...
29-Nov-2016 12:02:16 DEBUG collection: Document labels available.


['Doyle_AScandalinBohemia',
 'Doyle_AStudyinScarlet',
 'Doyle_TheHoundoftheBaskervilles',
 'Doyle_TheSignoftheFour',
 'Howard_GodsoftheNorth']

## 4. Load the corpora and optional e.g. stopwords list
By using the `read_from()`-functions we create a generator object which provides a memory efficient way to handle bigger corpora.

In [7]:
corpus_txt = collection.read_from_txt(doclist_txt)

In [8]:
corpus_csv = collection.read_from_csv(doclist_csv)

In [9]:
external = collection.read_from_txt("helpful_stuff/stopwords/en")

## 5. Segmenting text
An important part of pre-processing in Topic Modeling is segmenting the the texts in 'chunks'. The arguments of the function are for the targeted corpus and the size of the 'chunks' in words. Depending on the languange and type of text results can vary widely in quality.

In [10]:
segments = collection.segmenter(corpus_txt, 1000)
next(segments)

29-Nov-2016 12:02:21 DEBUG collection: Accessing TXT document ...
29-Nov-2016 12:02:21 INFO collection: Segmenting document ...
29-Nov-2016 12:02:21 DEBUG collection: Segment has a length of 1000 characters.


"A SCANDAL IN BOHEMIA\n\nA. CONAN DOYLE\n\n\nI\n\nTo Sherlock Holmes she is always _the_ woman. I have seldom heard him\nmention her under any other name. In his eyes she eclipses and\npredominates the whole of her sex. It was not that he felt any emotion\nakin to love for Irene Adler. All emotions, and that one particularly,\nwere abhorrent to his cold, precise but admirably balanced mind. He was,\nI take it, the most perfect reasoning and observing machine that the\nworld has seen; but as a lover, he would have placed himself in a false\nposition. He never spoke of the softer passions, save with a gibe and a\nsneer. They were admirable things for the observer--excellent for\ndrawing the veil from men's motives and actions. But for the trained\nreasoner to admit such intrusions into his own delicate and finely\nadjusted temperament was to introduce a distracting factor which might\nthrow a doubt upon all his mental results. Grit in a sensitive\ninstrument, or a crack in one of his own

## 6. Filtering text using POS-Tags
Another way to preprocess the text is by filtering by POS-Tags and using lemmas (in this case only adjectives, verbs and nouns are filterable). The annotated CSV-file we provide in this example is already enriched with this kind of information.


In [11]:
lemmas = collection.filter_POS_tags(corpus_csv)
next(lemmas)[:5]

29-Nov-2016 12:02:23 INFO collection: Accessing CSV documents ...
29-Nov-2016 12:02:23 INFO collection: Accessing ['ADJ', 'V', 'NN'] lemmas ...


37    typographical
56          textual
59           square
72              old
75             such
Name: Lemma, dtype: object

# 7. Count terms to use `find_stopwords()` and `find_hapax()`

In [12]:
corpus = collection.calculate_term_frequency(corpus_txt)

29-Nov-2016 12:02:25 INFO collection: Calculating term frequency ...
29-Nov-2016 12:02:25 DEBUG collection: Accessing TXT document ...
29-Nov-2016 12:02:25 DEBUG collection: Term frequency calculated.
29-Nov-2016 12:02:25 DEBUG collection: Accessing TXT document ...
29-Nov-2016 12:02:25 DEBUG collection: Term frequency calculated.
29-Nov-2016 12:02:25 DEBUG collection: Accessing TXT document ...
29-Nov-2016 12:02:25 DEBUG collection: Term frequency calculated.
29-Nov-2016 12:02:25 DEBUG collection: Accessing TXT document ...
29-Nov-2016 12:02:25 DEBUG collection: Term frequency calculated.
29-Nov-2016 12:02:25 DEBUG collection: Accessing TXT document ...
29-Nov-2016 12:02:25 DEBUG collection: Term frequency calculated.
29-Nov-2016 12:02:25 DEBUG collection: Accessing TXT document ...
29-Nov-2016 12:02:25 DEBUG collection: Term frequency calculated.
29-Nov-2016 12:02:25 DEBUG collection: Accessing TXT document ...
29-Nov-2016 12:02:25 DEBUG collection: Term frequency calculated.
29-Nov-

In [13]:
clean_corpus = corpus.copy()

## 8. Find stopwords

In [14]:
stopwords = collection.find_stopwords(corpus, 50)
stopwords[:5]

29-Nov-2016 12:02:30 INFO collection: Finding stopwords ...
29-Nov-2016 12:02:30 DEBUG collection: 50 stopwords found.


the    21357
of     11614
and    11040
to      8516
a       7652
dtype: int64

## 9. Find hapax legomena

In [15]:
hapax = collection.find_hapax(corpus)
hapax[:5]

29-Nov-2016 12:02:33 INFO collection: Find hapax legomena ...
29-Nov-2016 12:02:33 DEBUG collection: 26096 hapax legomena found.


"'--Un     1
"'About    1
"'After    1
"'All      1
"'An       1
dtype: int64

## 10. Remove features

In [16]:
clean_corpus = collection.remove_features(clean_corpus, stopwords)

29-Nov-2016 12:02:46 INFO collection: Removing features ...
29-Nov-2016 12:02:46 DEBUG collection: 50 features removed.


In [18]:
clean_corpus = collection.remove_features(clean_corpus, hapax)

29-Nov-2016 12:03:01 INFO collection: Removing features ...
29-Nov-2016 12:06:21 DEBUG collection: 26096 features removed.


In [19]:
clean_corpus = collection.remove_features(clean_corpus, external)

29-Nov-2016 12:06:21 INFO collection: Removing features ...
29-Nov-2016 12:06:21 DEBUG collection: Accessing TXT document ...
29-Nov-2016 12:06:22 DEBUG collection: 444 features removed.


In [20]:
print("Länge des Kropus: ", len(corpus))
print("Länge des Korpus (ohne features): ", len(clean_corpus))
print("5 MFWs:\n", corpus.sort_values(ascending=False).head(5), "\n")
print("5 MFWs (ohne features):\n", clean_corpus.sort_values(ascending=False).head(5))

Länge des Kropus:  43989
Länge des Korpus (ohne features):  17399
5 MFWs:
 the    21357
of     11614
and    11040
to      8516
a       7652
dtype: int64 

5 MFWs (ohne features):
 will    841
It      789
We      601
man     511
"I      496
dtype: int64


## 11. Matrix Market

In [21]:
matrix_market = collection.create_matrix_market(corpus, doc_labels)

StopIteration: 

## 12. Visualization
Simple get-functions are implemented for visualization tasks. In this case the get_labels-function extracts the titles of the corpus files we loaded above.

In [22]:
lda_model = 'out_easy/corpus.lda'
corpus = 'out_easy/corpus.mm'
dictionary = 'out_easy/corpus.dict'
doc_labels = 'out_easy/corpus_doclabels.txt'
interactive  = False

vis = collection.Visualization(lda_model, corpus, dictionary, doc_labels, interactive)

29-Nov-2016 12:06:37 INFO collection: Accessing corpus ...
29-Nov-2016 12:06:37 INFO gensim.corpora.indexedcorpus: loaded corpus index from out_easy/corpus.mm.index
29-Nov-2016 12:06:37 INFO gensim.matutils: initializing corpus reader from out_easy/corpus.mm
29-Nov-2016 12:06:37 INFO gensim.matutils: accepted corpus with 17 documents, 514 features, 4585 non-zero entries
29-Nov-2016 12:06:37 DEBUG collection: Corpus available.
29-Nov-2016 12:06:37 INFO collection: Accessing model ...
29-Nov-2016 12:06:37 INFO gensim.utils: loading LdaModel object from out_easy/corpus.lda
29-Nov-2016 12:06:37 INFO gensim.utils: loading id2word recursively from out_easy/corpus.lda.id2word.* with mmap=None
29-Nov-2016 12:06:37 INFO gensim.utils: setting ignored attribute state to None
29-Nov-2016 12:06:37 INFO gensim.utils: setting ignored attribute dispatcher to None
29-Nov-2016 12:06:37 INFO gensim.utils: loading LdaModel object from out_easy/corpus.lda.state
29-Nov-2016 12:06:37 DEBUG collection: Model 

In [23]:
vis.make_heatmap()

29-Nov-2016 12:06:39 INFO collection: Accessing topic distribution and topic probability ...
29-Nov-2016 12:06:39 DEBUG collection: Topic distribution and topic probability available.
29-Nov-2016 12:06:39 INFO collection: Accessing plot labels ...
29-Nov-2016 12:06:39 DEBUG collection: 10 plot labels available.
29-Nov-2016 12:06:39 INFO collection: Creating heatmap figure ...


For testing only...
corpus.mm:
[(0, 2.0), (1, 2.0), (2, 3.0), (3, 2.0), (4, 8.0), (5, 2.0), (6, 3.0), (7, 2.0), (8, 4.0), (9, 5.0), (10, 14.0), (11, 4.0), (12, 4.0), (13, 4.0), (14, 2.0), (15, 2.0), (16, 11.0), (17, 34.0), (18, 3.0), (19, 3.0), (20, 4.0), (21, 2.0), (22, 2.0), (23, 2.0), (24, 6.0), (25, 2.0), (26, 3.0), (27, 3.0), (28, 3.0), (29, 7.0), (30, 2.0), (31, 2.0), (32, 2.0), (33, 7.0), (34, 2.0), (35, 2.0), (36, 11.0), (37, 3.0), (38, 2.0), (39, 3.0), (40, 2.0), (41, 2.0), (42, 3.0), (43, 2.0), (44, 2.0), (45, 2.0), (46, 8.0), (47, 2.0), (48, 3.0), (49, 2.0), (50, 2.0), (51, 2.0), (52, 4.0), (53, 4.0), (54, 3.0), (55, 2.0), (56, 2.0), (57, 2.0), (58, 3.0), (59, 3.0), (60, 3.0), (61, 2.0), (62, 3.0), (63, 9.0), (64, 3.0), (65, 3.0), (66, 3.0), (67, 5.0), (68, 3.0), (69, 11.0), (70, 4.0), (71, 2.0), (72, 6.0), (73, 2.0), (74, 2.0), (75, 2.0), (76, 3.0), (77, 6.0), (78, 3.0), (79, 3.0), (80, 2.0), (81, 3.0), (82, 2.0), (83, 4.0), (84, 2.0), (85, 4.0), (86, 8.0), (87, 8.0), (88, 

29-Nov-2016 12:06:39 DEBUG collection: Heatmap figure available.


In [24]:
vis.save_heatmap("./visualizations/heatmap")

29-Nov-2016 12:06:40 INFO collection: Saving heatmap figure...
29-Nov-2016 12:06:40 DEBUG collection: Heatmap figure available at ./visualizations/heatmap/heatmap.png


In [25]:
vis = collection.Visualization(lda_model, corpus, dictionary, doc_labels, interactive=True)

29-Nov-2016 12:06:41 INFO collection: Accessing corpus ...
29-Nov-2016 12:06:41 INFO gensim.corpora.indexedcorpus: loaded corpus index from out_easy/corpus.mm.index
29-Nov-2016 12:06:41 INFO gensim.matutils: initializing corpus reader from out_easy/corpus.mm
29-Nov-2016 12:06:41 INFO gensim.matutils: accepted corpus with 17 documents, 514 features, 4585 non-zero entries
29-Nov-2016 12:06:41 DEBUG collection: Corpus available.
29-Nov-2016 12:06:41 INFO collection: Accessing model ...
29-Nov-2016 12:06:41 INFO gensim.utils: loading LdaModel object from out_easy/corpus.lda
29-Nov-2016 12:06:41 INFO gensim.utils: loading id2word recursively from out_easy/corpus.lda.id2word.* with mmap=None
29-Nov-2016 12:06:41 INFO gensim.utils: setting ignored attribute state to None
29-Nov-2016 12:06:41 INFO gensim.utils: setting ignored attribute dispatcher to None
29-Nov-2016 12:06:41 INFO gensim.utils: loading LdaModel object from out_easy/corpus.lda.state
29-Nov-2016 12:06:41 DEBUG collection: Model 

In [26]:
vis.make_interactive()

29-Nov-2016 12:06:44 INFO collection: Accessing model, corpus and dictionary ...
29-Nov-2016 12:06:44 DEBUG gensim.models.ldamodel: performing inference on a chunk of 17 documents
29-Nov-2016 12:06:44 DEBUG gensim.models.ldamodel: 6/17 documents converged within 50 iterations
29-Nov-2016 12:06:45 DEBUG collection: Interactive visualization available.


In [27]:
vis.save_interactive("./visualizations/interactive")

29-Nov-2016 12:06:45 INFO collection: Saving interactive visualization ...
29-Nov-2016 12:06:45 DEBUG collection: Interactive visualization available at ./visualizations/interactive/corpus_interactive.html and ./visualizations/interactive/corpus_interactive.json


![success](http://cdn2.hubspot.net/hub/128506/file-446943132-jpg/images/computer_woman_success.jpg)