Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NGrams doesnt support words with hyphen and slash in English #656

Closed
sam68740 opened this issue Oct 23, 2022 · 3 comments
Closed

NGrams doesnt support words with hyphen and slash in English #656

sam68740 opened this issue Oct 23, 2022 · 3 comments

Comments

@sam68740
Copy link

There are a few words in English that contain hyphen or slash

Example:

  • image-based
  • text-based
  • links/CTA

It would be great if Natural could manage these cases.

let text = "links text-based opposed image-based links/CTA’s"
var NGrams = natural.NGrams;
const T = natural.AggressiveTokenizer;
const tokenizer = new T();
NGrams.setTokenizer(tokenizer);
console.log(NGrams.ngrams(text, 1));

Output: [["links"], ["text"], ["based"], ["opposed"], ["image"], ["based"], ["links"], ["CTA"], ["s"]]

@Hugo-ter-Doest
Copy link
Collaborator

The root cause of this behaviour is the tokenizer. Will look into adapting the tokenizer to support these characters.

@sam68740
Copy link
Author

sam68740 commented May 29, 2023 via email

Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023
Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023
Solved issue #656
And discovered that Porter stemmer calls the tokenizer for English. Replaced that with the tokenizer for Dutch.
@Hugo-ter-Doest
Copy link
Collaborator

Solved in #706

Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023
Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023
Support for slash /
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants