short_text_tagger

short_text_tagger generates topic distributions for all texts in a corpus.

Free software: MIT license

Installation

pip install short-text-tagger

Usage

This package depends on graph-tool, which is a C++ library with a Python wrapper. See https://git.skewed.de/count0/graph-tool/-/wikis/installation-instructions for instructions on how to install graph-tool.

If you have graph-tool installed and want to use its community detection functionality to generate topics, then import generate_topic_distributions_from_corpus, which expects a pandas DataFrame with columns id and text:

# example 

import pandas as pd 
from short_text_tagger.short_text_tagger import generate_topic_distributions_from_corpus

sample_df = pd.DataFrame({
     'id':[1,2,3,...],
     'text':[
             'The store was crazy today. ',
             'I went to the store to get apples, oranges, and pears. But the lines were long. Waited 45 minutes to checkout.',
             'The lines were so short, so I was out of there quickly. I bought apples, pears, and beer.',
             ...
     ]
})

topics_df = generate_topic_distributions_from_corpus(sample_df)

The parameter block_level will influence how many final topics are present in the corpus. If the corpus is small, smaller block_level may be necessary due to the lack of many observations. If the corpus is very large, the NSBM will have much larger depth, so you may have to increase the block_level so you do not have an unwieldy amount of topics. block_level is set to 2 by default.

If you don't have graph-tool installed or want to provide your own word to topic maps, then you can import functions that perform text preprocessing and text topic probability generation:

# example 2

import pandas as pd 
from short_text_tagger.short_text_tagger import cleaned_texts_df_from_data
from short_text_tagger.short_text_tagger import assign_text_probabilities

sample_df = pd.DataFrame({
     'id':[1,2,3,...],
     'text':[
             'The store was crazy today. ',
             'I went to the store to get apples, oranges, and pears. But the lines were long. Waited 45 minutes to checkout.',
             'The lines were so short, so I was out of there quickly. I bought apples, pears, and beer.',
             ...
     ]
})

preprocessed_df = cleaned_texts_df_from_data(sample_df) # adds a "words" column (List[str])

# Create your own List[Dict[str,str]], where each element in the list is a dict of word to topic mappings.
# In this package, the function "word_to_block_dict" accomplishes this.
word_to_topic_dict_list = ...

final_df = assign_text_probabilities(preprocessed_df,word_to_topic_dict_list)

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
docs		docs
short_text_tagger		short_text_tagger
tests		tests
.gitignore		.gitignore
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

License

JohnAnthonyBowllan/short_text_tagger

Folders and files

Latest commit

History

Repository files navigation

short_text_tagger

Installation

Usage

Credits

About

Resources

License

Stars

Watchers

Forks

Languages