Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Text preprocessing module

Note that only French and English are supported on our end. Support for all languages in Spacy will be added at later time.


  • Just run pip3 install . from the root of the directory
  • You may need to install data files if you're going to use PoS tagging. For that, you should head to the website for instructions.

How to use

  • You need to import the PreProcessing class in your module, and instantiate it as below (all keywords below are defaults)
from text_preprocessing import PreProcessing

preproc = Preprocessing(
        word_regex=r"\w+",          # regex for tokenizing if passing a string
        sentence_regex=r"[.!?]+",   # regex for senting splitting if passing a string
        lemmatizer=None,            # path to file with form/lemma separated by tab, or just "spacy to spacy lemmatizer
        stopwords=None,             # path to file
        ascii=False,                # convert tokens to ASCII representation
        with_pos=False,             # return token object with PoS info
        pos_to_keep=[],             # return tokens that match list of POS (for POS available, see Spacy docs)
        is_philo_db=False,          # if parsing from a words_and_philo_ids file generated by PhiloLogic
        text_object_type="doc",     # define text object using PhiloLogic text object model
        return_type"words",         # return a list of words, or list of sentences
        workers=None,               # number of workers
        post_processing_function=None,  # optional post-processing function before each text object is returned
        progress=True               # show progress info
  • To process a list of files, you just pass a list of files to process_files(). It will return a Tokens object (which is very much like a list of strings with some added features) which contains a list of Token objects:
for text_object in preproc.process_files(["file1", "file2"]):
        resulting_string = " ".join(text_object) # create a string containing every token separated by a space
        surface_forms = " ".join(token.surface_form for token in text_object) # create a string containing every surface form of a token separated by a space
        print(text_object.metadata) # print the text object metadata (a dictionary)
  • process_files takes an optional keep_all=True keyword that will store all filtered words in the surface_form attribute of Token objects (e.g. "token.surface_form")

  • Token objects have four attributes:

    • text: the final form of the token after processing
    • surface_form: the original form of the token before processing. If keep_all=True was passed to process_files, you will also find tokens that have been filtered out.
    • pos_ = POS speech info if available
    • ext = contains additional metadata from PhiloLogic such as start_byte and end_byte.

    To print the contents of each token, you can do:

  • You can iterate over every token in returned text objects like so:

for text_object in preproc.process_files(["file1", "file2"]):
        for token in text_object:
                print(token) # shorthand for print(token.text)
                print(token.pos_) # print part of speech 
                print(token.ext["start_byte"], token.ext["end_byte"]) # print start and end byte of each token (if available)