Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can I use the original text in the snippets after cleaning? #54

Open
mikkokotila opened this issue Mar 19, 2020 · 10 comments
Open

how can I use the original text in the snippets after cleaning? #54

mikkokotila opened this issue Mar 19, 2020 · 10 comments

Comments

@mikkokotila
Copy link

Once I've removed stopwords using nltk or similar, I want to be able to see the original text snippets and not the ones without stopwords. How can I achieve that?

@JasonKessler
Copy link
Owner

JasonKessler commented Mar 27, 2020

The preferred way to remove stopwords in Scattertext is to pass the full documents into a Corpus factory, and then use the Corpus.remove_terms method to create a corpus free of stopwords. You'll still let be able to view the original documents in the scattertext explorer.

For example:

convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
	parse = lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse').build()

stoplisted_corpus = corpus.remove_terms(['a', 'the'])

On the other hand, you could pass in an alternate_text_field parameter into produce_scattertext_explorer or another compatible function. This would be a column name in the data frame used to create the corpus which would be searched and displayed in the Scattertext visualization. However, the alternative text field is not used to make the plot itself.

@sound118
Copy link

The preferred way to remove stopwords in Scattertext is to pass the full documents into a Corpus factory, and then use the Corpus.remove_terms method to create a corpus free of stopwords. You'll still let be able to view the original documents in the scattertext explorer.

For example:

convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
	parse = lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse').build()

stoplisted_corpus = corpus.remove_terms(['a', 'the'])

On the other hand, you could pass in an alternate_text_field parameter into produce_scattertext_explorer or another compatible function. This would be a column name in the data frame used to create the corpus which would be searched and displayed in the Scattertext visualization. However, the alternative text field is not used to make the plot itself.

I am actually facing the same issue as mentioned above. I am doing a scattertext plot for Chinese and I followed your instructions above by passing an alternate_text_field parameter into produce_scattertext_explorer. When I click a term in scattertext plot, no origianl text shows up in the snippets. Actually, it shows nothing in the snippets. How do I make the original text to be shown there?

@JasonKessler
Copy link
Owner

Could you upload the example which fails to show snippets?

@sound118
Copy link

@JasonKessler I just uploaded the example that can reproduce the issue, please see https://github.com/sound118/Scatter-text-for-Chinese

I used "jieba" package to remove stopwords list and load user-defined dictionary in case of any wrong Chinese term segmentation, applied your "chinese_nlp" afterwards.
df['parsed_text'] = df['parsed_text'].apply(chinese_nlp)

You can change the file path to run the program on your lcoal machine to find out the issue.

Thanks.

@JasonKessler
Copy link
Owner

I think the issue is that the alternative text field has to be whitespace-tokenized for the matcher to work.

@sound118
Copy link

@JasonKessler, thanks for the hint. It works after adding
df['text'] = df['text'].apply(chinese_nlp)
in the uploaded program. At least, it's still readable after whitespace-tokenized the alternative text field as apposed to the parsed documents in Chinese. It will be even better being able to support the original Chinese documents in the snippet if some features could be added in your scattertext package. Nevertheless, it's elegant enough.

@JasonKessler
Copy link
Owner

Glad to hear it works.

It would be a good feature for someone in the community to pick up and build.

@MastafaF
Copy link

Hi @JasonKessler ,

I have the same issue here and I could not solve it with your suggestion.
Basically, the following code is used:

data = data.loc[:, ['id', 'language', 'ProcessedText', 'OriginalText']]

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline


data['parse'] = data['ProcessedText'].apply(st.whitespace_nlp_with_sentences)

unigram_corpus = (st.CorpusFromParsedDocuments(data,
                                               category_col='language',
                                               parsed_col='parse')
                  .build().get_stoplisted_unigram_corpus())



html = st.produce_scattertext_explorer(
            unigram_corpus,
            category='French', category_name='French', not_category_name='German',
            minimum_term_frequency=0, pmi_threshold_coefficient=0,
            width_in_pixels=1000, metadata=unigram_corpus.get_df()['language'],
            alternative_text_field = 'OriginalText' ,
            transform=st.Scalers.dense_rank
)

What I expect is to see fully the text from the OriginalText column after clicking on a given word in the chart. However at the moment I only see a chunk of such text.

For example, when clicking on the word 'thank', I would see something like the following:

Thank you! 

When I expect to see the following instead:

This was a great moment. Thank you!

Basically, I do not want the chunking when search for a given word among my text column. Can we achieve that? 😄

@JasonKessler
Copy link
Owner

Try adding use_full_doc=True as an argument to produce_scattertext_explorer. If that doesn't work, could you please post an independently runnable example which demonstrates the problem?

@MastafaF
Copy link

Works great @JasonKessler! Thanks 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants