Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making custom token for use in a custom plugin #167

Open
nrakic90 opened this issue Sep 29, 2016 · 6 comments
Open

Making custom token for use in a custom plugin #167

nrakic90 opened this issue Sep 29, 2016 · 6 comments
Assignees
Milestone

Comments

@nrakic90
Copy link

Hello.

First I want to say good job on this plugin.
I am making a plugin that will detect custom format, something in the lines of "keyword://test/test1/test2".
I managed to make a plugin based on what I saw in hasthag.js and mention.js . I am having trouble making a token out of "keyword". Can you explain this process a bit? I've attached a "sketch" of my plugin, would you kindly tell me what am I doing wrong? I would be grateful. All the best!
untitled

@nfrasser
Copy link
Collaborator

Hey @nrakic90, the first thing I wanted to mention is that the plugin API is largely undocumented and is subject to change in the future. Given that, kudos to you for figuring this out.

The big roadblock you'll run into next is due to a fundamental problem with the plugin API: There's no easy way to integrate new text tokens with the rest of the link-parsing state machine. I'm going to try my best to help you out here, but this is going to get complicated.

The first thing you need is to generate intermediate CharacterStates for the keyword text token. This will involve a call to the stateify function after you've defined KEYWORD_TOKEN. That should look like this:

let intermediateKeywordStates = stateify('keyword', S_START, KEYWORD_TOKEN, linkify.scanner.tokens.DOMAIN);

Then you'll need a loop like this for the intermediate states, since those could have jumps to domains (e.g., key is an intermediate state that could be a domain, and keys is a domain but even though it started with the key, it will never resolve to keyword). ALPHANUM should be defined like this.

See how the localhost text token is handled for a real example of this.

In your example, seeing the text token keyword jumps you into the S_KEYWORD state from the S_START state. But what happens if instead of // you see .com? Then you'd expect keyword.com to be of type url. But text tokens currently are not polymorphic, so you'd have to manually define jumps to and from S_KEYWORD. Basically, you'll need to duplicate all lines in parser.js that contain S_DOMAIN and replace S_DOMAIN with S_KEYWORD.

TL;DR, this is doable but not pretty. There are definitely plans on improving this interface to abstract-away all this complexity, but for now that's all the help I can offer.

@nrakic90
Copy link
Author

nrakic90 commented Oct 3, 2016

Thank you so much for an in-depth explanation, I really appreciate it! I was experimenting with statefy at one point but then gave it up because I didn't have all the pieces of the puzzle apparently.
Thanks again!

@nfrasser nfrasser added this to the 3.0 milestone Mar 11, 2021
@nfrasser nfrasser mentioned this issue Mar 11, 2021
@nfrasser nfrasser modified the milestones: 3.0, 4.0 Oct 14, 2021
@nfrasser nfrasser self-assigned this Oct 14, 2021
@toger5
Copy link

toger5 commented Jan 19, 2022

Has this gotten any easier. I really would like to use a custom token!

@toger5
Copy link

toger5 commented Jan 19, 2022

In the docs it seems like, it should be possible to do S_START.tt("a", acceptedState)
to transition on an 'a'.
From the documentation

* @param {string} input character or token to transition on

This does not seem to work. How is the word character meant in the docs.

@nfrasser
Copy link
Collaborator

@toger5 I'm working on some additional examples/docs for this in an upcoming release. For now, check out the hashtag plugin for reference

Notes:

  • Linkify has two state machines for tokenizing strings, the scanner and parser
  • The scanner groups string characters into smaller, self-container tokens such as NUM (a number) or TLD (any top-level domain name like "com")
    • The starting state (S_START) is scanner.start
  • The parser (used in the hashtag plugin example) groups text tokens from the scanner into "multi-tokens" such as URL, EmailAddress or Hashtag
    • The starting state is parser.start
  • Similarly to how adding the hashtag multi-token works in the example plugin, you can add a new scanner token. For example:
    const GreetingState = scanner.start
      .tt('h')
      .tt('e')
      .tt('l')
      .tt('l')
      .tt('o', 'GREETING') // create accepting state
    The scanner will recognize the word "hello" as a GREETING token. You can capture the states and branch off to recognize additional GREETINGs:
    const HState = scanner.start.tt('h')
    const GreetingState = HState
      .tt('i', GreetingState)  // don't create a new accepting state, use the existing one
    Now both "hi" and "hello" are recognized as GREETING tokens. You can similarly use the GREETING token with the scanner:
    const GreetingMultiToken = utils.createTokenClass('greeting', { 
      isLink: true,
      toHref() {
        return `javascript:alert("${this.toString()}!")`
     })
    parser.start.tt('GREETING', GreetingMultiToken)
  • There is no way to create tokens from arbitrary regular expressions right now with the tt method
    • You can, however, emulate anything that's possible with a regular expression by capturing the states and transitioning between them multiple times (the second argument to tt is either an accepting token or any previously-captured state).
    • This may improve in a future release.

@toger5
Copy link

toger5 commented Jan 21, 2022

This is super helpful thank you very much for the detailed comment!
I was trying something like this:

const acceptingState = createTokenClass("something")
scanner.start
  .tt('h')
  .tt('e')
  .tt('l')
  .tt('l')
  .tt('o', acceptingState)

but that did not seem to work.
For me PARAM1 and PARAM2 in const PARAM1 = state.tt('TOKEN') and state.tt('TOKEN', PARAM2) were basically the same except, that in the second case PARAM2 needs to be created before.
In your example they seem to differ, so that PARAM2 can also be used to add a new token called GREETING.
But this seems to indicate, that there is another difference between PARAM1 and PARAM2

// (A)
const GreetingState = HState
  .tt('i', GreetingState)  // don't create a new accepting state, use the existing one
// VS
// (B)
const GreetingState = HState
  .tt('i')

What I tried is (A) but that does not seem to work. (B) however does. What exactly is the difference between those two?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants