NGrams doesnt support words with hyphen and slash in English #656

sam68740 · 2022-10-23T13:01:01Z

There are a few words in English that contain hyphen or slash

Example:

image-based
text-based
links/CTA

It would be great if Natural could manage these cases.

let text = "links text-based opposed image-based links/CTA’s"
var NGrams = natural.NGrams;
const T = natural.AggressiveTokenizer;
const tokenizer = new T();
NGrams.setTokenizer(tokenizer);
console.log(NGrams.ngrams(text, 1));

Output: [["links"], ["text"], ["based"], ["opposed"], ["image"], ["based"], ["links"], ["CTA"], ["s"]]

The text was updated successfully, but these errors were encountered:

Hugo-ter-Doest · 2023-05-29T18:11:00Z

The root cause of this behaviour is the tokenizer. Will look into adapting the tokenizer to support these characters.

sam68740 · 2023-05-29T19:14:25Z

There is also other issues in other language. For instance the Eszett symbol ß is also not supported. If you can also check, it will be nice. Thanks

…

On Mon, May 29, 2023 at 8:11 PM Hugo ter Doest ***@***.***> wrote: The root cause of this behaviour is the tokenizer. Will look into adapting the tokenizer to support these characters. — Reply to this email directly, view it on GitHub <#656 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHL44R6OWCQAPK5LD7FEUDXITRD7ANCNFSM6AAAAAARMJBVLY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Solved issue #656 And discovered that Porter stemmer calls the tokenizer for English. Replaced that with the tokenizer for Dutch.

Hugo-ter-Doest · 2023-11-26T20:27:18Z

Solved in #706

Support for slash /

Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023

Solved issue #656

6a7bb92

Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023

Solved issue #656 (#706)

4bd7ca4

Solved issue #656 And discovered that Porter stemmer calls the tokenizer for English. Replaced that with the tokenizer for Dutch.

Hugo-ter-Doest mentioned this issue Nov 26, 2023

Check for Eszett symbol ß in German tokenizer #707

Closed

Hugo-ter-Doest closed this as completed Nov 26, 2023

Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023

Issue #656 part 2

0656c85

Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023

Issue #656 part 2 (#708)

5d1f491

Support for slash /

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NGrams doesnt support words with hyphen and slash in English #656

NGrams doesnt support words with hyphen and slash in English #656

sam68740 commented Oct 23, 2022

Hugo-ter-Doest commented May 29, 2023

sam68740 commented May 29, 2023 via email

Hugo-ter-Doest commented Nov 26, 2023

NGrams doesnt support words with hyphen and slash in English #656

NGrams doesnt support words with hyphen and slash in English #656

Comments

sam68740 commented Oct 23, 2022

Hugo-ter-Doest commented May 29, 2023

sam68740 commented May 29, 2023 via email

Hugo-ter-Doest commented Nov 26, 2023