Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Sentence segmentation in gensim #1135

Closed
narenchoudhary opened this issue Feb 4, 2017 · 5 comments
Closed

[Question] Sentence segmentation in gensim #1135

narenchoudhary opened this issue Feb 4, 2017 · 5 comments

Comments

@narenchoudhary
Copy link

I need to split corpus text stream into sentences for further processing. I checked gensim documentation but could not find anything on sentence segmentation.

Is there any utility available in gensim for sentence segmentation?

@narenchoudhary narenchoudhary changed the title Sentence segmentation in gensim [Question] Sentence segmentation in gensim Feb 4, 2017
@gojomo
Copy link
Collaborator

gojomo commented Feb 4, 2017

No, gensim generally expects such tokenization to happen elsewhere. Popular Python options for this include NLTK or Spacy.io.

@narenchoudhary
Copy link
Author

It's a very basic and must have functionality in a nlp library. gensim provides functions for parsing text from corpus. It also has models that take list of sentences as argument. But there is no utility for segmenting text into sentences which is very disappointing. 😞

@piskvorky
Copy link
Owner

I agree. Unsupervised segmentation (into blocks/sentences/words) falls nicely under gensim's mission. Pull requests welcome :)

CC @tmylk .

@tmylk
Copy link
Contributor

tmylk commented Feb 6, 2017

Added to wiki

@tmylk tmylk closed this as completed Feb 6, 2017
@rpedela
Copy link

rpedela commented Feb 9, 2017

The summarization code splits text into sentences. Based on the English summary results, it does a great job.

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/summarization/textcleaner.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants