Skip to content

Commit

Permalink
Update preprocessing.py
Browse files Browse the repository at this point in the history
  • Loading branch information
severinsimmler committed Oct 7, 2017
1 parent 472b0d1 commit f623ddc
Showing 1 changed file with 45 additions and 22 deletions.
67 changes: 45 additions & 22 deletions dariah_topics/preprocessing.py
Expand Up @@ -5,38 +5,61 @@
Processing Text Data, Creating Matrices and Cleaning Corpora
============================================================
Functions of this module are for **preprocessing purpose**. You can read text files,
tokenize and segment documents, create document-term matrices, determine and
remove features and read existing matrices. Recurrent variable names are based
on the following conventions:
Functions of this module are for **preprocessing purpose**. You can read text \
files, `tokenize <https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)>`_ \
and segment documents, create `document-term matrices <https://en.wikipedia.org/wiki/Document-term_matrix>`_, \
determine and remove features and read existing matrices. Recurrent variable names are \
based on the following conventions:
1. Corpora:
***********
* ``corpus`` means an iterable containing at least one `document`.
* ``document`` means one single string containing all characters of a text
file, including whitespaces, punctuations, etc. In case of specific
CSV-files, there is no ``document``, but only ``dkpro_document``,
because each of those CSV-files contain extra information such as
POS-tags, lemmas, etc.
* ``corpus`` means an iterable containing at least one ``document`` or ``dkpro_document``.
* ``document`` means one single string containing all characters of a text \
file, including whitespaces, punctuations, etc.
* ``dkpro_document`` means a pandas DataFrame containing tokens and additional \
information, e.g. *part-of-speech tags* or *lemmas*.
* ``tokenized_corpus`` means an iterable containing at least one ``tokenized_document``.
* ``tokenized_document`` means an iterable containing tokens of a `document`.
* ``clean_tokenized_corpus`` means an iterable containing at least
one ``clean_tokenized_document``.
* ``clean_tokenized_document`` means an iterable containing only specific
tokens of a ``tokenized_document`` (e.g. stopwords and hapax legomena are excluded).
* ``dkpro_document`` means a pandas DataFrame containing at
least tokens and POS-tags.
* ``document_labels`` means an iterable containing names of each ``document``
* ``tokenized_document`` means an iterable containing tokens of a ``document``.
* ``clean_tokenized_corpus`` means an iterable containing at least one ``clean_tokenized_document``.
* ``clean_tokenized_document`` means an iterable containing only specific \
tokens (e.g. no *stopwords* or hapax *legomena*) of a ``tokenized_document``.
* ``document_labels`` means an iterable containing names of each ``document`` \
and must have as much elements as ``corpus``, ``tokenized_corpus`` or
``clean_tokenized_corpus``, respectively.
Furthermore, if a document is chunked into smaller segments, each segment counts
as one document.
"""
__author__ = "DARIAH-DE"
__authors__ = "Steffen Pielstroem, Philip Duerholt, Sina Bock, Severin Simmler"
__email__ = "pielstroem@biozentrum.uni-wuerzburg.de"
2. Data models:
***************
* ``document_term_matrix`` means either a pandas DataFrame with rows corresponding to \
``document_labels`` and columns to types (distinct tokens in the corpus). The \
single values are token frequencies, or a pandas DataFrame with a MultiIndex \
and only one column corresponding to word frequencies. The first column of the \
MultiIndex corresponds to a document ID (based on ``document_labels``) and the \
second column to a type ID.
Contents:
#########
* :func:`create_document_term_matrix()` creates a document-term matrix, for either \
large or small corpora.
* :func:`duplicate_document_label()` duplicates a ``document_label`` with consecutive \
numbers.
* :func:`filter_dkpro_document()` filters a ``dkpro_document`` by specific \
*part-of-speech tags*.
* :func:`find_hapax_legomena()` determines *hapax legomena* based on frequencies \
of a ``document_term_matrix``.
* :func:`find_stopwords()` determines *most frequent words* based on frequencies \
of a ``document_term_matrix``.
* :func:`read_from_pathlist()` reads one or multiple files based on a pathlist.
* :func:`segment()` is a wrapper for :func:`segment_fuzzy()` and segments a \
``tokenized_document`` into segments of a certain number of tokens, respecting existing chunks.
* :func:`segment_fuzzy()` segments a ``tokenized_document``, tolerating existing \
chunks (like paragraphs).
* :func:`split_paragraphs()` splits a ``document`` by paragraphs.
* :func:`tokenize()` tokenizes a ``document`` based on a Unicode regular expression.
* :func:`remove_features()` removes features from a ``document_term_matrix``.
"""

from collections import Counter, defaultdict
import csv
Expand Down

0 comments on commit f623ddc

Please sign in to comment.