short_text_tagger generates topic distributions for all texts in a corpus.
- Free software: MIT license
pip install short-text-tagger
This package depends on graph-tool, which is a C++ library with a Python wrapper. See https://git.skewed.de/count0/graph-tool/-/wikis/installation-instructions for instructions on how to install graph-tool.
If you have graph-tool installed and want to use its community detection functionality to generate topics, then import generate_topic_distributions_from_corpus
, which expects a pandas DataFrame with columns id
and text
:
# example
import pandas as pd
from short_text_tagger.short_text_tagger import generate_topic_distributions_from_corpus
sample_df = pd.DataFrame({
'id':[1,2,3,...],
'text':[
'The store was crazy today. ',
'I went to the store to get apples, oranges, and pears. But the lines were long. Waited 45 minutes to checkout.',
'The lines were so short, so I was out of there quickly. I bought apples, pears, and beer.',
...
]
})
topics_df = generate_topic_distributions_from_corpus(sample_df)
The parameter block_level
will influence how many final topics are present in the corpus. If the corpus is small, smaller block_level
may be necessary due to the lack of many observations. If the corpus is very large, the NSBM will have much larger depth, so you may have to increase the block_level
so you do not have an unwieldy amount of topics. block_level
is set to 2 by default.
If you don't have graph-tool installed or want to provide your own word to topic maps, then you can import functions that perform text preprocessing and text topic probability generation:
# example 2
import pandas as pd
from short_text_tagger.short_text_tagger import cleaned_texts_df_from_data
from short_text_tagger.short_text_tagger import assign_text_probabilities
sample_df = pd.DataFrame({
'id':[1,2,3,...],
'text':[
'The store was crazy today. ',
'I went to the store to get apples, oranges, and pears. But the lines were long. Waited 45 minutes to checkout.',
'The lines were so short, so I was out of there quickly. I bought apples, pears, and beer.',
...
]
})
preprocessed_df = cleaned_texts_df_from_data(sample_df) # adds a "words" column (List[str])
# Create your own List[Dict[str,str]], where each element in the list is a dict of word to topic mappings.
# In this package, the function "word_to_block_dict" accomplishes this.
word_to_topic_dict_list = ...
final_df = assign_text_probabilities(preprocessed_df,word_to_topic_dict_list)
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.