-
Notifications
You must be signed in to change notification settings - Fork 0
License
JavaScriptorian/textelixir
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# TextElixir TextElixir is a Python module that facilitates the data collection and analysis of textual corpora. While some programs like AntConc and WordCruncher are faster in speed, TextElixir provides the flexibility to gather data on limitless queries without needing to navigate through a graphical interface. ## Installation Install textelixir with pip ```bash pip install textelixir ``` ## Importing TextElixir ```bash from textelixir import TextElixir ``` ## Tagging a Corpus ```http elixir = TextElixir('my_corpus.txt', lang='en', tagger_option='spacy:efficient:pos') ``` | Parameter | Type | Description | | :-------- | :------- | :------------------------- | | `filename` | `string` or `glob` | **Required**. Accepts a path to a filename, which can be a TXT file, a TSV file, or a glob of multiple TXT files. | | `lang` | `string` | **Optional**. Accepts a language code. Defaults to `en` for English. For available languages, see [SpaCy](https://spacy.io/models) or [Stanza](https://stanfordnlp.github.io/stanza/available_models.html) for available tagging models. | | `tagger_option` | `string` | **Required**. Accepts one of these for options for tagging:<br>`spacy:efficient:pos`: Uses a fast SpaCy tagger and uses tags like VERB, NOUN<br>`spacy:accurate:xpos`: Uses a slow SpaCy tagger and uses tags like VBZ, NN1<br>`stanza:pos`: Uses a Stanza tagger and uses tags like VERB, NOUN<br>`stanza:xpos`: Uses a Stanza tagger and uses tags like VBZ, NN1| TextElixir will tag a TXT file, a glob of TXT files, or a TSV file. By running this line of code, it'll create a subfolder within the directory you provide and add all of the POS tags and lemmas to each word. By saving the tagged corpus into that subfolder, this line only needs to be run once. If you provide a TSV file, make sure that each column has headers and that the column with the text you want tagged has the column header of `text`. By doing this, you can use the other columns for search filtering and reference grouping. If you provide a glob of TXT files, the filename will still be associated to each word within the corpus, which helps with basic search filtering if the text filename has a semantic meaning. TextElixir has a couple options for different taggers to use. At the moment, you can have TextElixir tag your text with SpaCy or Stanza taggers. The modules for spacy and stanza are included in the installation of TextElixir, but the language models for each must be downloaded prior to tagging a text. ## Loading the Tagged Corpus ```http elixir = TextElixir('my_corpus', verbose=False, punct_pos=['PUNCT', 'SYM', 'SPACE']) ``` | Parameter | Type | Description | | :-------- | :------- | :------------------------- | | `filename` | `string` | **Required**. Accepts the name of the subdirectory created when tagging a corpus. | | `verbose` | `boolean` | **Optional**. If False, it turns off any messages printed out to the terminal. Defaults to True. | | `punct_pos` | `list` | **Optional**. Accepts a list of strings of the parts of speech that you want to consider `subwords`. These words will be skipped when doing searches and when calculating collocates, ngrams, etc. To search for punctuation, you'll need to set this argument to an empty list.| When loading a tagged corpus, the `lang` and `tagger_option` arguments are no longer needed. # Search Performing a search is one of the main methods that you can use on a **TextElixir**. The search method returns a **SearchResults** class, which contains the frequency of the search query and a list of indices for where your search query occurs in the corpus. The SearchResults class contains several other methods for calculating collocates, frequency distribution charts, sentences, and concordance lines. ## Searching for a Word ```bash results = elixir.search('engage') ``` ## Searching for a Lemma ```bash results = elixir.search('ENGAGE') ``` ## Searching for a Part of Speech ```bash results = elixir.search('/VERB/') ``` ## Searching for a Word with its Part of Speech ```bash results = elixir.search('leaves_NOUN') ``` ## Searching for a Lemma with its Part of Speech ```bash results = elixir.search('ADVOCATE_NOUN') ``` ## Searching for a Phrase ```bash results = elixir.search('/ADJ/ cat') ``` ## Searching with Wildcards ```bash # Find inform, informs, information, etc. results = elixir.search('inform*') # Find bat, cat, etc. results = elixir.search('?at') ``` ## Searching with Regular Expressions ```bash # Find 4-digit numbers results = elixir.search(r'\d{4}', regex=True) ``` ## Search for Words Separated by Distance ```bash # Finds the word 'supporter' 1-5 words away from the lemma 'cat' results = elixir.search('supporter ~5~ CAT') ``` # Filter the Corpus Prior to Search Filters can be applied to a corpus prior to searching. This allows you to get data on specific sections of your corpus rather than the entire corpus. ## Positive Filter ``` # Searches for the word 'advocate' as long as it's within the category of 'Philosophy' results = elixir.search('advocate', text_filter={'category': 'Philosophy'}) ``` ## Negative Filter ``` # Searches for the word 'advocate' as long as it's NOT within the category of 'Philosophy' results = elixir.search('advocate', text_filter={'category': '!Philosophy'}) ``` # Calculate KWIC/Concordance Lines ``` kwic = results.kwic_lines(before=8, after=8, group_by='lower') ``` ## Export KWIC Lines to HTML This will generate a webpage with an interface to sort and filter KWIC lines. You can also switch the display of columns to show pos, lemma, and lower text. ``` results.export_as_html('my_kwic_lines.html') ``` ## Export KWIC Lines to TXT This will generate a text file with just the KWIC lines. There is a tab character before and after the search hit, making it easy to paste into Google Sheets. ``` results.export_as_txt('my_kwic_lines.txt') ``` # Calculate Sentences Get the full sentence for more context of your search query. ``` sentences = results.sentences() ```
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published