Update preprocessing.py

DARIAH-DE · Oct 7, 2017 · f623ddc · f623ddc
1 parent 472b0d1
commit f623ddc
Showing 1 changed file with 45 additions and 22 deletions.
diff --git a/dariah_topics/preprocessing.py b/dariah_topics/preprocessing.py
@@ -5,38 +5,61 @@
 Processing Text Data, Creating Matrices and Cleaning Corpora
 ============================================================
 
-Functions of this module are for **preprocessing purpose**. You can read text files,
-tokenize and segment documents, create document-term matrices, determine and
-remove features and read existing matrices. Recurrent variable names are based
-on the following conventions:
+Functions of this module are for **preprocessing purpose**. You can read text \
+files, `tokenize <https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)>`_ \
+and segment documents, create `document-term matrices <https://en.wikipedia.org/wiki/Document-term_matrix>`_, \
+determine and remove features and read existing matrices. Recurrent variable names are \
+based on the following conventions:
     
 1. Corpora:
 ***********
-    * ``corpus`` means an iterable containing at least one `document`.
-    * ``document`` means one single string containing all characters of a text
-    file, including whitespaces, punctuations, etc. In case of specific
-    CSV-files, there is no ``document``, but only ``dkpro_document``,
-    because each of those CSV-files contain extra information such as
-    POS-tags, lemmas, etc.
+    * ``corpus`` means an iterable containing at least one ``document`` or ``dkpro_document``.
+    * ``document`` means one single string containing all characters of a text \
+    file, including whitespaces, punctuations, etc.
+    * ``dkpro_document`` means a pandas DataFrame containing tokens and additional \
+    information, e.g. *part-of-speech tags* or *lemmas*.
     * ``tokenized_corpus`` means an iterable containing at least one ``tokenized_document``.
-    * ``tokenized_document`` means an iterable containing tokens of a `document`.
-    * ``clean_tokenized_corpus`` means an iterable containing at least 
-    one ``clean_tokenized_document``.
-    * ``clean_tokenized_document`` means an iterable containing only specific
-    tokens of a ``tokenized_document`` (e.g. stopwords and hapax legomena are excluded). 
-    * ``dkpro_document`` means a pandas DataFrame containing at
-    least tokens and POS-tags.
-    * ``document_labels`` means an iterable containing names of each ``document``
+    * ``tokenized_document`` means an iterable containing tokens of a ``document``.
+    * ``clean_tokenized_corpus`` means an iterable containing at least  one ``clean_tokenized_document``.
+    * ``clean_tokenized_document`` means an iterable containing only specific \
+    tokens (e.g. no *stopwords* or hapax *legomena*) of a ``tokenized_document``. 
+    * ``document_labels`` means an iterable containing names of each ``document`` \
     and must have as much elements as ``corpus``, ``tokenized_corpus`` or
     ``clean_tokenized_corpus``, respectively.
 
     Furthermore, if a document is chunked into smaller segments, each segment counts
     as one document.
-"""
 
-__author__ = "DARIAH-DE"
-__authors__ = "Steffen Pielstroem, Philip Duerholt, Sina Bock, Severin Simmler"
-__email__ = "pielstroem@biozentrum.uni-wuerzburg.de"
+2. Data models:
+***************
+    * ``document_term_matrix`` means either a pandas DataFrame with rows corresponding to \
+    ``document_labels`` and columns to types (distinct tokens in the corpus). The \
+    single values are token frequencies, or a pandas DataFrame with a MultiIndex \
+    and only one column corresponding to word frequencies. The first column of the \
+    MultiIndex corresponds to a document ID (based on ``document_labels``) and the \
+    second column to a type ID.
+
+Contents:
+#########
+* :func:`create_document_term_matrix()` creates a document-term matrix, for either \
+large or small corpora.
+* :func:`duplicate_document_label()` duplicates a ``document_label`` with consecutive \
+numbers.
+* :func:`filter_dkpro_document()` filters a ``dkpro_document`` by specific \
+*part-of-speech tags*.
+* :func:`find_hapax_legomena()` determines *hapax legomena* based on frequencies \
+of a ``document_term_matrix``.
+* :func:`find_stopwords()` determines *most frequent words* based on frequencies \
+of a ``document_term_matrix``.
+* :func:`read_from_pathlist()` reads one or multiple files based on a pathlist.
+* :func:`segment()` is a wrapper for :func:`segment_fuzzy()` and segments a \
+``tokenized_document`` into segments of a certain number of tokens, respecting existing chunks.
+* :func:`segment_fuzzy()` segments a ``tokenized_document``, tolerating existing \
+chunks (like paragraphs).
+* :func:`split_paragraphs()` splits a ``document`` by paragraphs.
+* :func:`tokenize()` tokenizes a ``document`` based on a Unicode regular expression.
+* :func:`remove_features()` removes features from a ``document_term_matrix``.
+"""
 
 from collections import Counter, defaultdict
 import csv