# Testing `collection.py`
The following tutorial shows how to use the `collection` module of DARIAH-Topics.

## 1. Prearrangement
First you need to import the collection module so your IPython notebook has access to its functions and classes. As second step we set paths for a test corpus consisting of plain text files and one consisting of annotated text preprocessed with several NLP-Tools in form of CSV files (if you have questions concerning the format, click [here](https://github.com/DARIAH-DE/DARIAH-DKPro-Wrapper/blob/master/doc/tutorial.adoc)).

In [2]:
import collection

In [3]:
path_txt = "corpus_txt"
path_csv = "corpus_csv"

## 2. Creating list of filenames (plain text and csv files)
The following function is used to normalize path names so non-uniform text files will be processable by the module. It is possible to add an additional argument `ext` where you can specify an extension.


In [4]:
doclist_txt = collection.create_document_list(path_txt)
doclist_txt[:5]

24-Nov-2016 15:35:19 INFO collection: Creating document list from TXT files ...
24-Nov-2016 15:35:19 DEBUG collection: 17 entries in document list.


['corpus_txt/Doyle_AScandalinBohemia.txt',
 'corpus_txt/Doyle_AStudyinScarlet.txt',
 'corpus_txt/Doyle_TheHoundoftheBaskervilles.txt',
 'corpus_txt/Doyle_TheSignoftheFour.txt',
 'corpus_txt/Howard_GodsoftheNorth.txt']

In [5]:
doclist_csv = collection.create_document_list(path_csv, 'csv')
doclist_csv[:5]

24-Nov-2016 15:35:19 INFO collection: Creating document list from CSV files ...
24-Nov-2016 15:35:19 DEBUG collection: 16 entries in document list.


['corpus_csv/Doyle_AStudyinScarlet.txt.csv',
 'corpus_csv/Doyle_TheHoundoftheBaskervilles.txt.csv',
 'corpus_csv/Doyle_TheSignoftheFour.txt.csv',
 'corpus_csv/Howard_GodsoftheNorth.txt.csv',
 'corpus_csv/Howard_SchadowsinZamboula.txt.csv']

## 3. Getting document labels

In [6]:
doc_labels = collection.get_labels(doclist_txt)
list(doc_labels)[:5]

24-Nov-2016 15:35:20 INFO collection: Creating document labels ...
24-Nov-2016 15:35:20 DEBUG collection: Document labels available.


['Doyle_AScandalinBohemia.txt',
 'Doyle_AStudyinScarlet.txt',
 'Doyle_TheHoundoftheBaskervilles.txt',
 'Doyle_TheSignoftheFour.txt',
 'Howard_GodsoftheNorth.txt']

## 4. Load the corpora and optional e.g. stopwords list
By using the `read_from()`-functions we create a generator object which provides a memory efficient way to handle bigger corpora.

In [7]:
corpus_txt = collection.read_from_txt(doclist_txt)

In [8]:
corpus_csv = collection.read_from_csv(doclist_csv)

In [9]:
external = collection.read_from_txt("helpful_stuff/stopwords/en")

## 5. Segmenting text
An important part of pre-processing in Topic Modeling is segmenting the the texts in 'chunks'. The arguments of the function are for the targeted corpus and the size of the 'chunks' in words. Depending on the languange and type of text results can vary widely in quality.

In [10]:
segments = collection.segmenter(corpus_txt, 1000)
next(segments)

24-Nov-2016 15:35:22 DEBUG collection: Accessing TXT document ...
24-Nov-2016 15:35:22 INFO collection: Segmenting document ...
24-Nov-2016 15:35:22 DEBUG collection: Segment has a length of 1000 characters.


"A SCANDAL IN BOHEMIA\n\nA. CONAN DOYLE\n\n\nI\n\nTo Sherlock Holmes she is always _the_ woman. I have seldom heard him\nmention her under any other name. In his eyes she eclipses and\npredominates the whole of her sex. It was not that he felt any emotion\nakin to love for Irene Adler. All emotions, and that one particularly,\nwere abhorrent to his cold, precise but admirably balanced mind. He was,\nI take it, the most perfect reasoning and observing machine that the\nworld has seen; but as a lover, he would have placed himself in a false\nposition. He never spoke of the softer passions, save with a gibe and a\nsneer. They were admirable things for the observer--excellent for\ndrawing the veil from men's motives and actions. But for the trained\nreasoner to admit such intrusions into his own delicate and finely\nadjusted temperament was to introduce a distracting factor which might\nthrow a doubt upon all his mental results. Grit in a sensitive\ninstrument, or a crack in one of his own

## 6. Filtering text using POS-Tags
Another way to preprocess the text is by filtering by POS-Tags and using lemmas (in this case only adjectives, verbs and nouns are filterable). The annotated CSV-file we provide in this example is already enriched with this kind of information.


In [11]:
lemmas = collection.filter_POS_tags(corpus_csv)
next(lemmas)[:5]

24-Nov-2016 15:35:23 INFO collection: Accessing CSV documents ...
24-Nov-2016 15:35:23 INFO collection: Accessing ['ADJ', 'V', 'NN'] lemmas ...


37    typographical
56          textual
59           square
72              old
75             such
Name: Lemma, dtype: object

# 7. Count terms to use `find_stopwords()` and `find_hapax()`

In [12]:
corpus = collection.calculate_term_frequency(corpus_txt)

24-Nov-2016 15:35:24 INFO collection: Calculating term frequency ...
24-Nov-2016 15:35:24 DEBUG collection: Accessing TXT document ...
24-Nov-2016 15:35:24 DEBUG collection: Term frequency calculated.
24-Nov-2016 15:35:24 DEBUG collection: Accessing TXT document ...
24-Nov-2016 15:35:24 DEBUG collection: Term frequency calculated.
24-Nov-2016 15:35:24 DEBUG collection: Accessing TXT document ...
24-Nov-2016 15:35:24 DEBUG collection: Term frequency calculated.
24-Nov-2016 15:35:24 DEBUG collection: Accessing TXT document ...
24-Nov-2016 15:35:24 DEBUG collection: Term frequency calculated.
24-Nov-2016 15:35:24 DEBUG collection: Accessing TXT document ...
24-Nov-2016 15:35:24 DEBUG collection: Term frequency calculated.
24-Nov-2016 15:35:24 DEBUG collection: Accessing TXT document ...
24-Nov-2016 15:35:24 DEBUG collection: Term frequency calculated.
24-Nov-2016 15:35:24 DEBUG collection: Accessing TXT document ...
24-Nov-2016 15:35:24 DEBUG collection: Term frequency calculated.
24-Nov-

In [13]:
clean_corpus = corpus.copy()

## 8. Find stopwords

In [14]:
stopwords = collection.find_stopwords(corpus, 50)
stopwords[:5]

24-Nov-2016 15:35:27 INFO collection: Finding stopwords ...
24-Nov-2016 15:35:27 DEBUG collection: 50 stopwords found.


the    21357
of     11614
and    11040
to      8516
a       7652
dtype: int64

## 9. Find hapax legomena

In [15]:
hapax = collection.find_hapax(corpus)
hapax = hapax[:50]
hapax[:5]

24-Nov-2016 15:35:28 INFO collection: Find hapax legomena ...
24-Nov-2016 15:35:28 DEBUG collection: 26096 hapax legomena found.


"'--Un     1
"'About    1
"'After    1
"'All      1
"'An       1
dtype: int64

## 10. Remove features

In [16]:
clean_corpus = collection.remove_features(clean_corpus, stopwords)

24-Nov-2016 15:35:29 INFO collection: Removing features ...
24-Nov-2016 15:35:29 DEBUG collection: 50 features removed.


In [17]:
# Warum dauert es so lange hapax komplett zu entfernen? Bei
# den 100 vorherigen Durchläufen hat es < 1 Sekunde gedauert
# und inhaltlich ist alles gleich?!?! War DataFrame schneller
# als Series?!?!?!

clean_corpus = collection.remove_features(clean_corpus, hapax)

24-Nov-2016 15:35:30 INFO collection: Removing features ...
24-Nov-2016 15:35:31 DEBUG collection: 50 features removed.


In [18]:
clean_corpus = collection.remove_features(clean_corpus, external)

24-Nov-2016 15:35:34 INFO collection: Removing features ...
24-Nov-2016 15:35:34 DEBUG collection: Accessing TXT document ...
24-Nov-2016 15:35:38 DEBUG collection: 476 features removed.


In [19]:
print("Länge des Kropus: ", len(corpus))
print("Länge des Korpus (ohne features): ", len(clean_corpus))
print("5 MFWs:\n", corpus.sort_values(ascending=False).head(5), "\n")
print("5 MFWs (ohne features):\n", clean_corpus.sort_values(ascending=False).head(5))

Länge des Kropus:  43989
Länge des Korpus (ohne features):  43413
5 MFWs:
 the    21357
of     11614
and    11040
to      8516
a       7652
dtype: int64 

5 MFWs (ohne features):
 will    841
It      789
We      601
man     511
"I      496
dtype: int64


## 11. Matrix Market

In [20]:
matrix_market = collection.create_matrix_market(clean_corpus, doc_labels)

StopIteration: 

## 12. Visualization
Simple get-functions are implemented for visualization tasks. In this case the get_labels-function extracts the titles of the corpus files we loaded above.

In [None]:
lda_model = 'out_easy/corpus.lda'
corpus = 'out_easy/corpus.mm'
dictionary = 'out_easy/corpus.dict'
doc_labels = 'out_easy/corpus_doclabels.txt'
interactive  = False

vis = collection.Visualization(lda_model, corpus, dictionary, doc_labels, interactive)

It is not possible to run `save_heatmap()` before `make_heatmap()`.

In [None]:
vis.save_heatmap("./visualizations/heatmap")

In [None]:
vis.make_heatmap()

In [None]:
vis.save_heatmap("./visualizations/heatmap")

In [None]:
vis = collection.Visualization(lda_model, corpus, dictionary, doc_labels, interactive=True)

It is not possible to run `save_heatmap()` before `make_heatmap()`.

In [None]:
vis.save_interactive("./visualizations/interactive")

In [None]:
vis.make_interactive()

In [None]:
vis.save_interactive("./visualizations/interactive")

![success](http://cdn2.hubspot.net/hub/128506/file-446943132-jpg/images/computer_woman_success.jpg)