Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Regex-validated string type #6579

Closed
tenry92 opened this issue Jan 22, 2016 · 146 comments
Closed

Suggestion: Regex-validated string type #6579

tenry92 opened this issue Jan 22, 2016 · 146 comments

Comments

@tenry92
Copy link

@tenry92 tenry92 commented Jan 22, 2016

There are cases, where a property can not just be any string (or a set of strings), but needs to match a pattern.

let fontStyle: 'normal' | 'italic' = 'normal'; // already available in master
let fontColor: /^#([0-9a-f]{3}|[0-9a-f]{6})$/i = '#000'; // my suggestion

It's common practice in JavaScript to store color values in css notation, such as in the css style reflection of DOM nodes or various 3rd party libraries.

What do you think?

@DanielRosenwasser
Copy link
Member

@DanielRosenwasser DanielRosenwasser commented Jan 24, 2016

Yeah, I've seen this combing through DefinitelyTyped, . Even we could use something like this with ScriptElementKind in the services layer, where we'd ideally be able to describe these as a comma-separated list of specific strings.

The main problems are:

  • It's not clear how to compose these well. If I want a comma-separated list of "cat", "dog", and "fish", then I need to write something like /dog|cat|fish(,(dog|cat|fish))*/.
    • If I already have types describing string literal types for "cat", "dog", and "fish", how do I integrate them into this regex?
    • Clearly there's repetition here, which is undesirable. Perhaps fixing the previous issue would make this easier.
  • Non-standard extensions make this sort of iffy.
@zackarychapple
Copy link

@zackarychapple zackarychapple commented Apr 13, 2016

Huge +1 on this, ZipCode, SSN, ONet, many other use cases for this.

@radziksh
Copy link

@radziksh radziksh commented May 11, 2016

I faced the same problem, and I see that it is not implemented yet, maybe this workaround will be helpful:
http://stackoverflow.com/questions/37144672/guid-uuid-type-in-typescript

@rylphs
Copy link

@rylphs rylphs commented May 18, 2016

As @mhegazy suggested I will put my sugggestion (#8665) here. What about allow simple validation functions in type declarations? Something like that:

type Integer(n:number) => String(n).macth(/^[0-9]+$/)
let x:Integer = 3 //OK
let y:Integer = 3.6 //wrong

type ColorLevel(n:number) => n>0 && n<= 255
type RGB = {red:ColorLevel, green:ColorLevel, blue:ColorLevel};
let redColor:RGB = {red:255, green:0, blue:0}   //OK
let wrongColor:RGB = {red:255, green:900, blue:0} //wrong

type Hex(n:string) => n.match(/^([0-9]|[A-F])+$/)
let hexValue:Hex = "F6A5" //OK
let wrongHexValue:Hex = "F6AZ5" //wrong

The value that the type can accept would be determined by the function parameter type and by the function evaluation itself. That would solve #7982 also.

@AlexLeung
Copy link

@AlexLeung AlexLeung commented Jul 23, 2016

@rylphs +1 this would make TypeScript extremely powerful

@maiermic
Copy link

@maiermic maiermic commented Aug 30, 2016

How does subtyping work with regex-validated string types?

let a: RegExType_1
let b: RegExType_2

a = b // Is this allowed? Is RegExType_2 subtype of RegExType_1?
b = a // Is this allowed? Is RegExType_1 subtype of RegExType_2?

where RegExType_1 and RegExType_2 are regex-validated string types.

Edit: It looks like this problem is solvable in polynomial time (see The Inclusion Problem for Regular Expressions).

@basarat
Copy link
Contributor

@basarat basarat commented Oct 17, 2016

Would also help with TypeStyle : typestyle/typestyle#5 🌹

@DanielRosenwasser
Copy link
Member

@DanielRosenwasser DanielRosenwasser commented Oct 17, 2016

In JSX, @RyanCavanaugh and I've seen people add aria- (and potentially data-) attributes. Someone actually added a string index signature in DefinitelyTyped as a catch-all. A new index signature for this would have be helpful.

interface IntrinsicElements {
    // ....
    [attributeName: /aria-\w+/]: number | string | boolean;
}
@Igmat
Copy link

@Igmat Igmat commented Nov 18, 2016

Design Proposal

There are a lot of cases when developers need more specified value then just a string, but can't enumerate them as union of simple string literals e.g. css colors, emails, phone numbers, ZipCode, swagger extensions etc. Even json schema specification which commonly used for describing schema of JSON object has pattern and patternProperties that in terms of TS type system could be called regex-validated string type and regex-validated string type of index.

Goals

Provide developers with type system that is one step closer to JSON Schema, that commonly used by them and also prevent them from forgetting about string validation checks when needed.

Syntactic overview

Implementation of this feature consists of 4 parts:

Regex validated type

type CssColor = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i;
type Email = /^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/i;
type Gmail = /^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@gmail\.com$/i;

Regex-validated variable type

let fontColor: /^#([0-9a-f]{3}|[0-9a-f]{6})$/i;

and the same, but more readable

let fontColor: CssColor;

Regex-validated variable type of index

interface UsersCollection {
    [email: /^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/i]: User;
}

and the same, but more readable

interface UsersCollection {
    [email: Email]: User;
}

Type guard for variable type

setFontColorFromString(color: string) {
    fontColor = color;// compile time error
    if (/^#([0-9a-f]{3}|[0-9a-f]{6})$/i.test(color)) {
        fontColor = color;// correct
    }
}

and same

setFontColorFromString(color: string) {
    fontColor = color;// compile time error
    if (!(/^#([0-9a-f]{3}|[0-9a-f]{6})$/i.test(color))) return;
    fontColor = color;// correct
}

and using defined type for better readability

setFontColorFromString(color: string) {
    fontColor = color;// compile time error
    if (CssColor.test(color)) {
        fontColor = color;// correct
    }
}

same as

setFontColorFromString(color: string) {
    fontColor = color;// compile time error
    if (!(CssColor.test(color))) return;
    fontColor = color;// correct
}

Type gurard for index type

let collection: UsersCollection;
getUserByEmail(email: string) {
    collection[email];// type is any
    if (/^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/i.test(email)) {
        collection[email];// type is User
    }
}

same as

let collection: UsersCollection;
getUserByEmail(email: string) {
    collection[email];// type is any
    if (!(/^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/i.test(email))) return;
    collection[email];// type is User
}

and using defined type for better readability

let collection: UsersCollection;
getUserByEmail(email: string) {
    collection[email];// type is any
    if (Email.test(email)) {
        collection[email];// type is User
    }
}

same as

let collection: UsersCollection;
getUserByEmail(email: string) {
    collection[email];// type is any
    if (!(Email.test(email))) return;
    collection[email];// type is User
}

Semantic overview

Assignments

let email: Email;
let gmail: Gmail;
email = 'test@example.com';// correct
email = 'test@gmail.com';// correct
gmail = 'test@example.com';// compile time error
gmail = 'test@gmail.com';// correct
gmail = email;// obviously compile time error
email = gmail;// unfortunately compile time error too

Unfortunately we can't check is one regex is subtype of another without hard performance impact due to this article. So it should be restricted. But there are next workarounds:

// explicit cast
gmail = <Gmail>email;// correct
// type guard
if (Gmail.test(email)) {
    gmail = email;// correct
}
// another regex subtype declaration
type Gmail = Email & /^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@gmail\.com$/i;
gmail = email;// correct

Unfortunately assigning of string variable to regex-validated variable should also be restricted, because there is no guaranty in compile time that it will match regex.

let someEmail = 'test@example.com';
let someGmail = 'test@gmail.com';
email = someEmail;// compile time error
gmail = someGmail;// compile time error

But we are able to use explicit cast or type guards as shown here. Second is recommended.
Luckily it's not a case for string literals, because while using them we ARE able to check that its value matches regex.

let someEmail: 'test@example.com' = 'test@example.com';
let someGmail: 'test@gmail.com' = 'test@gmail.com';
email = someEmail;// correct
gmail = someGmail;// correct

Type narrowing for indexes

For simple cases of regex-validated type of index see Type gurard for index type.
But there could be more complicated cases:

type Email = /^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/i;
type Gmail = /^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@gmail\.com$/i;
interface UsersCollection {
    [email: Email]: User;
    [gmail: Gmail]: GmailUser;
}
let collection: UsersCollection;
let someEmail = 'test@example.com';
let someGmail = 'test@gmail.com';
collection['test@example.com'];// type is User
collection['test@gmail.com'];// type is User & GmailUser
collection[someEmail];// unfortunately type is any
collection[someGmail];// unfortunately type is any
// explicit cast is still an unsafe workaround
collection[<Email> someEmail];// type is User
collection[<Gmail> someGmail];// type is GmailUser
collection[<Email & Gmail> someGmail];// type is User & GmailUser

Literals haven't such problem:

let collection: UsersCollection;
let someEmail: 'test@example.com' = 'test@example.com';
let someGmail: 'test@gmail.com' = 'test@gmail.com';
collection[someEmail];// type is User
collection[someGmail];// type is User & GmailUser

But for variables the best option is using type guards as in next more realistic examples:

getUserByEmail(email: string) {
    collection[email];// type is any
    if (Email.test(email)) {
        collection[email];// type is User
        if (Gmail.test(email)) {
            collection[email];// type is User & GmailUser
        }
    }
    if (Gmail.test(email)) {
        collection[email];// type is GmailUser
    }
}

But if we'll use better definition for Gmail type it would have another type narrowing:

type Gmail = Email & /^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@gmail\.com$/i;
getUserByEmail(email: string) {
    collection[email];// type is any
    if (Email.test(email)) {
        collection[email];// type is User
        if (Gmail.test(email)) {
            collection[email];// type is User & GmailUser
        }
    }
    if (Gmail.test(email)) {
        collection[email];// type is User & GmailUser
    }
}

Unions and intersections

Actually common types and regex-validated types are really different, so we need rules how correclty handle their unions and intersections.

type Regex_1 = / ... /;
type Regex_2 = / ... /;
type NonRegex = { ... };
type test_1 = Regex_1 | Regex_2;// correct
type test_2 = Regex_1 & Regex_2;// correct
type test_3 = Regex_1 | NonRegex;// correct
type test_4 = Regex_1 & NonRegex;// compile time error
if (test_1.test(something)) {
    something;// type is test_1
    // something matches Regex_1 OR Regex_2
}
if (test_2.test(something)) {
    something;// type is test_2
    // something matches Regex_1 AND Regex_2
}
if (test_3.test(something)) {
    something;// type is Regex_1
} else {
    something;// type is NonRegex
}

Generics

There are no special cases for generics, so regex-validated type could be used with generics in same way as usual types.
For generics with constraints like below, regex-validated type behaves like string:

class Something<T extends String> { ... }
let something = new Something<Email>();// correct

Emit overview

Unlike usual types, regex-validated have some impact on emit:

type Regex_1 = / ... /;
type Regex_2 = / ... /;
type NonRegex = { ... };
type test_1 = Regex_1 | Regex_2;
type test_2 = Regex_1 & Regex_2;
type test_3 = Regex_1 | NonRegex;
type test_4 = Regex_1 & NonRegex;
if (test_1.test(something)) {
    /* ... */
}
if (test_2.test(something)) {
    /* ... */
}
if (test_3.test(something)) {
    /* ... */
} else {
    /* ... */
}

will compile to:

var Regex_1 = / ... /;
var Regex_2 = / ... /;
if (Regex_1.test(something) || Regex_2.test(something)) {
    /* ... */
}
if (Regex_1.test(something) && Regex_2.test(something)) {
    /* ... */
}
if (Regex_1.test(something)) {
    /* ... */
} else {
    /* ... */
}

Compatibility overview

This feature has no issues with compatibility, because there only case that could break it and it is related to that regex-validated type has emit impact unlike usual type, so this is valid TS code:

type someType = { ... };
var someType = { ... };

when code below is not:

type someRegex = / ... /;
var someRegex = { ... };

But second already WAS invalid, but due to another reason (type declaration was wrong).
So now we have to restrict declaring of variable with name same to type, in case when this type is regex-validated.

P.S.

Feel free to point on things that I probably have missed. If you like this proposal, I could try to create tests that covers it and add them as PR.

Igmat added a commit to Igmat/TypeScript that referenced this issue Nov 18, 2016
Igmat added a commit to Igmat/TypeScript that referenced this issue Nov 21, 2016
Igmat added a commit to Igmat/TypeScript that referenced this issue Nov 21, 2016
@Igmat
Copy link

@Igmat Igmat commented Nov 21, 2016

I've forgotten to point to some cases for intersections and unions of regex-validated types, but I've described them in latest test case. Should I update Design proposal to reflect that minor change?

Igmat added a commit to Igmat/TypeScript that referenced this issue Nov 22, 2016
@alexanderbird
Copy link

@alexanderbird alexanderbird commented Nov 29, 2016

@Igmat, question about your design proposal: Could you elaborate on the emit overview? Why would regex-validated types need to be emitted? As far as I can tell, other types don't support runtime checks... am I missing something?

@Igmat
Copy link

@Igmat Igmat commented Nov 30, 2016

@alexanderbird, yes, any other type have no impact on emit. At first, I thought that regex-validated will do so as well, so I've started creating the proposal and playing with proposed syntax.
First approach was like this:

let fontColor: /^#([0-9a-f]{3}|[0-9a-f]{6})$/i;
fontColor = "#000";

and this:

type CssColor: /^#([0-9a-f]{3}|[0-9a-f]{6})$/i;
let fontColor: CssColor;
fontColor = "#000";

It's ok and has no need for emit changes, because "#000" could be checked in compile time.
But we also have to handle narrowing from string to regex-validated type in order to make it useful. So I've thought about this for both previous setups:

let someString: string;
if (/^#([0-9a-f]{3}|[0-9a-f]{6})$/i.test(someString)) {
    fontColor = someString; // Ok
}
fontColor = someString; // compile time error

So it also has no impact on emit and looks ok, except that regex isn't very readable and have to be copied in all places, so user could easily make a mistake. But in this particular case it still seems to be better than changing how type works.
But then I realized that this stuff:

let someString: string;
let email: /^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/I;
if (/^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/i.test(someString)) {
    email = someString; // Ok
}
email = someString; // compile time error

is a nightmare. And it's even without intersections and unions. So to avoid happening of stuff like this, we have to slightly change type emit as shown in proposal.

@Igmat
Copy link

@Igmat Igmat commented Nov 30, 2016

@DanielRosenwasser, could you, please, provide some feedback for this proposal? And also for tests referenced here, if possible?
I really want to help with implementing of this feature, but it requires a lot of time (tsc is really complicated project and I still have to work on understanding of how it works inside) and I don't know is this proposal is ready to implement or you will reject this feature implemented in this way due to another language design vision or any other reason.

@DanielRosenwasser
Copy link
Member

@DanielRosenwasser DanielRosenwasser commented Nov 30, 2016

Hey @Igmat, I think there are a few things I should have initially asked about

To start, I still don't understand why you need any sort of change to emit, and I don't think any sort of emit based on types would be acceptable. Check out our non-goals here.

Another issue I should have brought up is the problem of regular expressions that use backreferences. My understanding (and experience) is that backreferences in a regular expression can force a test to run in time exponential to its input. Is this a corner case? Perhaps, but it's something I'd prefer to avoid in general. This is especially important given that in editor scenarios, a type-check at a location should take a minimal amount of time.

Another issue is that we'd need to either rely on the engine that the TypeScript compiler runs on, or build a custom regular expression engine to execute these things. For instance, TC39 is moving to include a new s flag so that . can match newlines. There would be a discrepancy between ESXXXX and older runtimes that support this.

@alexanderbird
Copy link

@alexanderbird alexanderbird commented Nov 30, 2016

@Igmat - there is no question in my mind that having regexes emitted at runtime would be useful. However, I don't think they're necessary for this feature to be useful (and from the sounds of what @DanielRosenwasser has said, it probably wouldn't get approved anyway). You said

But we also have to handle narrowing from string to regex-validated type in order to make it useful

I think this is only the case if we are to narrow from a dynamic string to a regex-validated type. This gets very complicated. Even in this simple case:

function foo(bar: number) {
    let baz: /prefix:\d+/ = 'prefix:' + number;
}

We can't be sure that the types will match - what if the number is negative? And as the regexes get more complicated, it just gets messier and messier. If we really wanted this, maybe we allow "type interpolation: type Baz = /prefix:{number}/... but I don't know if it's worth going there.

Instead, we could get partway to the goal if we only allowed string literals to be assigned to regex-validated types.

Consider the following:

type Color = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i;
let foo: Color = '#000000';
let bar: Color = '#0000'; // Error - string literal '#0000' is not assignable to type 'Color'; '#0000' does not match /^#([0-9a-f]{3}|[0-9a-f]{6})$/i
let baz: Color = '#' + config.userColorChoice; // Error - type 'string' is not assignable to type 'regex-validated-string'

Do you think that's a workable alternative?

@Igmat
Copy link

@Igmat Igmat commented Dec 1, 2016

@DanielRosenwasser, I've read Design Goals carefully and, if I understand you correctly, problem is violation of Non-goals#5.
But it doesn't seem to me as violation, but as syntax improvement. For example, previously we had:

const emailRegex = /.../;
/**
 * assign it only with values tested to emailRegex 
 */
let email: string;
let userInput: string;
// somehow get user input
if (emailRegex.test(userInput)) {
    email = userInput;
} else {
    console.log('User provided invalid email. Showing validation error');
    // Some code for validation error
}

With this proposal implemented it would look like:

type Email = /.../;
let email: Email;
let userInput: string;
// somehow get user input
if (Email.test(userInput)) {
    email = userInput;
} else {
    console.log('User provided invalid email. Showing validation error');
    // Some code for validation error
}

As you see, code is almost the same - it's a common simple usage of regex. But second case is much more expressive and will prevent user from accidental mistake, like forgetting to check string before assignment it to variable that meant to be regex-validated.
Second thing is that without such type narrowing we won't be able to normally use regex-validated type in indexes, because in most cases such index fields works with some variable that can't be checked in runtime as it could be done with literals.

@Igmat
Copy link

@Igmat Igmat commented Dec 1, 2016

@alexanderbird, I don't suggest making this code valid or add some hidden checks in both runtime and compile time.

function foo(bar: number) {
    let baz: /prefix:\d+/ = 'prefix:' + number;
}

This code have to throw error due to my proposal. But this:

function foo(bar: number) {
    let baz: /prefix:\d+/ = ('prefix:' + number) as /prefix:\d+/;
}

or this:

function foo(bar: number) {
    let baz: /prefix:\d+/;
    let possibleBaz: string = 'prefix:' + number;
    if (/prefix:\d+/.test(possibleBaz)) {
        baz = possibleBaz;
    }
}

would be correct, and even have no impact to emitted code.

And as I showed in previous comment, literals would be definitely not enough even for common use cases, because we often have to work with stings from user input or other sources. Without implementing of this emit impact, users would have to work with this type in next way:

export type Email = /.../;
export const Email = /.../;
let email: Email;
let userInput: string;
// somehow get user input
if (Email.test(userInput)) {
    email = <Email>userInput;
} else {
    console.log('User provided invalid email. Showing validation error');
    // Some code for validation error
}

or for intersections:

export type Email = /email-regex/;
export const Email = /email-regex/;
export type Gmail = Email & /gmail-regex/;
export const Gmail = {
    test: (input: string) => Email.test(input) && /gmail-regex/.test(input)
};
let gmail: Gmail;
let userInput: string;
// somehow get user input
if (Gmail.test(userInput)) {
    gmail = <Gmail>userInput;
} else {
    console.log('User provided invalid gmail. Showing validation error');
    // Some code for validation error
}

I don't think that forcing users to duplicate code and to use explicit cast, when it could be easily handled by compiler isn't a good way to go. Emit impact is really very small and predictable, I'm sure that it won't surprise users or lead to some feature misunderstood or hard to locate bugs, while implementing this feature without emit changes definitely WILL.

In conclusion I want to say that in simple terms regex-validated type is both a scoped variable and a compiler type.

@Igmat
Copy link

@Igmat Igmat commented Dec 1, 2016

@DanielRosenwasser and @alexanderbird ok, I have one more idea for that. What about syntax like this:

const type Email = /email-regex/;

In this case user have to explicitly define that he/she want this as both type and const, so actual type system has no emit changes unless it used with such modifier. But if it used with it we are still able to avoid a lot of mistakes, casts and duplication of code by adding same emit as for:

const Email = /email-regex/;

This seems to be even bigger than just improvement for this proposal, because this probably could allow something like this (example is from project with Redux):

export type SOME_ACTION = 'SOME_ACTION';
export const SOME_ACTION = 'SOME_ACTION' as SOME_ACTION;

being converted to

export const type SOME_ACTION = 'SOME_ACTION';

I've tried to found some similar suggestion but wasn't successful. If it could be a workaround and if you like such idea, I can prepare Design Proposal and tests for it.

@Igmat
Copy link

@Igmat Igmat commented Dec 1, 2016

@DanielRosenwasser, about your second issue - I don't think that it would ever happen, because in my suggestion compiler runs regex only for literals and it doesn't seems that someone will do something like this:

let something: /some-regex-with-backreferences/ = `
long enough string to make regex.test significantly affect performance
`

Anyway we could test how long literal should be for affecting real-time performance and create some heuristic that will warn user if we are unable to check it while he faces this circumstances in some editor scenarios, but we would check it when he will compile the project. Or there could be some other workarounds.

About third question, I'm not sure that understand everything correctly, but it seems that regex engine should be selected depending on target from tsconfig if they have different implementations. Needs some more investigation.

@Igmat
Copy link

@Igmat Igmat commented Dec 6, 2016

@DanielRosenwasser are there any thoughts? 😄 About initial proposal and about last one. May be I have to make more detailed overview of second one, do I?

@rozzzly
Copy link

@rozzzly rozzzly commented Jul 31, 2020

What about a non-regex syntax like this:

type TLD = 'com' | 'net' | 'org';
type Domain = `${string}.${TLD}`;
type URL = `${'http'|'https'}://${Domain}`;

const good: URL = 'https://google.com'; // ✔️
const bad: URL = 'ftp://example.com'; // ✖️ TypeError: 'ftp' is not assignable to type 'http' | 'https'

In my mind, this would fit quite nicely in the type system. You could add different syntax for things like optional matches:

type SubDomain = `${string}.`;
type Domain = `${SubDomain}?${string}.${TLD}`;

Add support for quantifiers, greedy operator and you've got something quite robust I would think that would probably be enough for a majority of the use cases developers might want to use this for.

@nikeee
Copy link
Contributor

@nikeee nikeee commented Jul 31, 2020

I think this approach would be more user-friendly. However, it seems to be aquvalent to arithmetic operations on types.
According to #15645 (comment) and #15794 (comment), it is a design decision to not do arithmetics on types.
If I'm not mistaking, this approach can easily create an enormous type. Consider:

type TLD = 'com' | 'net' | 'org' | 'ly' | 'a' | 'b' | 'c' | 'd';
type Foo = `${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}`;
type Bar = `${Foo}${Foo}${Foo}${Foo}${Foo}`

(this assumes the implementation would use union types. It may work with a different / more complex implementation)

Dislaimer: I am not part of the TS team and I am not working on TS. Just my 2c.

@isiahmeadows
Copy link

@isiahmeadows isiahmeadows commented Aug 1, 2020

@rozzzly @nikeee That's more or less the essence of my proposal, just with a few smaller features missing. I based mine on a large subset of regular languages (the formal language concept), not regular expressions in the sense of regexp literals and such, so it's much less powerful than those but powerful enough to get the job done.

I think this approach would be more user-friendly. However, it seems to be aquvalent to arithmetic operations on types.

Math says that validating whether a type is a subtype of another is computationally equivalent to checking if a string is contained within a given formal language.

Domain validation specifically is actually a pretty complicated thing to do if you also check for TLD/public suffix validity. Generic domains themselves per the RFC are as simple as /[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)+/ + being at most 255 characters, but even this is very complicated to type unless you go for full regular grammars as the above regexp demonstrates. You could programmatically generate the type pretty straightforwardly (I'll leave it as an exercise for the reader) using only strings from @rozzzly's or my proposal, but the end result is still rather complicated.

@rozzzly
Copy link

@rozzzly rozzzly commented Aug 2, 2020

@isiahmeadows

That's more or less the essence of my proposal, just with a few smaller features missing.

The last time I read through this entire thread was well over a year ago. I was on my break and saw a notification, read @rugk's comment about "well …don't call it "regex" – for a starter" which got me thinking... I hadn't realized someone had already pitched a considerably more detailed proposal for essentially the same (/a very similar) idea.

...even this is very complicated to type unless you go for full regular grammars as the above regexp demonstrates. You could programmatically generate the type pretty straightforwardly (I'll leave it as an exercise for the reader) using only strings from @rozzzly's or my proposal, but the end result is still rather complicated.

In my mind, some facility for limited pattern matching such I suggested allow would be extremely useful for very simplistic, and necessarily not-rigorously strict typing. The example I gave is far from precise, and wouldn't blow up the compiler.

But as @nikeee and you both point out this could be taken to dangerous extremes. Assuming a most naive implementation supporting only unions. Somebody is going to ruin everyone's day publishing an update to @types/some-popular-project that contained:

type MixedCaseAlphaNumeric = (
	| 'a'
	| 'b'
	| 'c'
	// and so on
);

type StrWithLengthBeteen1And64<Charset extends string> = (
	| `${Charset}`
	| `${Charset}|${Charset}`
	| `${Charset}|${Charset}|${Charset}`
	// and so on
);

function updatePassword(userID: number, password: StrWithLengthBetween1And64<MixedCaseAlphaNumeric>): void {
	// ...
}

Putting that into perspective, that union consists of distinct types which is more than atoms in the observable universe.

Now, I've seen some dreadfully long assignability errors but imagine (the untruncated) error for that....

Type '"😢"' is not assignable to type '"a"|"b"|"c"..........."'.ts(2322)'

So yeah.. there are some issues there

@simonbuchan
Copy link

@simonbuchan simonbuchan commented Aug 2, 2020

@rozzzly What makes that type any different (in terms of feasibility) than a TupleWithLengthBeteen1And64<Charset>?
The complier isn't forced to expand every type to a normalized form, it would quickly explode on fairly normal types if it did.
Not saying I think this issue makes sense in typescript at the moment, if even "integer between 3 and 1024" (think message buffer allocation lengths) is considered out of scope.

@isiahmeadows
Copy link

@isiahmeadows isiahmeadows commented Aug 2, 2020

@simonbuchan At least prefix and suffix types need to exist, if nothing else. That is itself required for many DOM libraries and frameworks.

@AnyhowStep
Copy link
Contributor

@AnyhowStep AnyhowStep commented Sep 29, 2020

I know this has been beaten to death and some good proposals have been given already. But I just wanted to add extra stuff that some might find mildly interesting.

In the face of backreferences, I suspect it still is decidable, but likely exponential if not worse. Just an educated guess, though.

Back references can make a regexp describe a context-sensitive grammar, a superset of context-free grammars. And language equality for CFGs is undecidable. So it's even worse for CSGs, which are equivalent to linear-bounded automatons.


Assuming just all the regular expressions that can be converted to a DFA are used in a regexp (concat, union, star, intersection, complement, etc.), converting a regexp to an NFA is O(n), getting the product of two NFAs is O(m*n), then traversing the resulting graph for accept states is O(m*n). So, checking the language equality/subset of two regular regexps is also O(m*n).

The problem is that the alphabet is really large here. Textbooks restrict themselves to alphabets of size 1-5 usually, when talking about DFAs/NFAs/regular expressions. But with JS regexps, we have all of unicode as our alphabet. Granted, there can be efficient ways of representing transition functions using sparse arrays and other clever hacks and optimizations for equality/subset testing...

I'm confident it's possible to do type checking for regular-to-regular assignment somewhat efficiently.

Then, all non-regular assignments can just require explicit type assertions.

I've recently worked on a small finite automaton project, so the info is still fresh in my mind =x

@nikeee
Copy link
Contributor

@nikeee nikeee commented Sep 29, 2020

If I'm not mistaking, this approach can easily create an enormous type. Consider:

type TLD = 'com' | 'net' | 'org' | 'ly' | 'a' | 'b' | 'c' | 'd';
type Foo = `${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}${TLD}`;
type Bar = `${Foo}${Foo}${Foo}${Foo}${Foo}`

(this assumes the implementation would use union types. It may work with a different / more complex implementation)

Funnily enough, this is exactly what's possible with the new template string literal types. This case is avoided by having a threshold for union types, it seems.

@isiahmeadows
Copy link

@isiahmeadows isiahmeadows commented Sep 30, 2020

@AnyhowStep JS backreferences are the only context-sensitive production (and a fairly simple and limited one at that - only up to 9 groups can be referenced like that), and the rest of the regexp grammar is regular, so that's why I suspect it is decidable. But regardless, I think we can agree it's not practical in any sense of the word. 🙂

Edit: accuracy

@styfle
Copy link
Contributor

@styfle styfle commented Oct 1, 2020

I confirmed this comment from @rozzzly works with TS 4.1.0 nightly!

type TLD = 'com' | 'net' | 'org';
type Domain = `${string}.${TLD}`;
type Url = `${'http'|'https'}://${Domain}`;

const success: Url = 'https://example.com';
const fail: Url = 'example.com';
const domain: Domain = 'example.com';

Try it in the playground and see that fail has a compile time error 🤩


Update: after playing with this feature a bit, it will not cover many use cases. For example, it doesn't work for a hex color string.

type HexChar = '0' | '1' | '2' | '3' | '4' | '5' | '6'| '7' | '8' | '9' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F';
type HexColor = `#${HexChar}${HexChar}${HexChar}${HexChar}${HexChar}${HexChar}`;
let color: HexColor = '#123456';

Today, that fails with "Expression produces a union type that is too complex to represent.(2590)"

@brandonthomas
Copy link

@brandonthomas brandonthomas commented Oct 1, 2020

I confirmed this comment from @rozzzly works with TS 4.1.0 nightly!

type TLD = 'com' | 'net' | 'org';
type Domain = `${string}.${TLD}`;
type Url = `${'http'|'https'}://${Domain}`;

const success: Url = 'https://example.com';
const fail: Url = 'example.com';
const domain: Domain = 'example.com';

Try it in the playground and see that fail has a compile time error 🤩

This would solve the data- or aria- problem that most of us face in UX libraries if it can be applied to indexes.

@brandonthomas
Copy link

@brandonthomas brandonthomas commented Oct 1, 2020

@chadlavi-casebook
Copy link

@chadlavi-casebook chadlavi-casebook commented Oct 1, 2020

Update: after playing with this feature a bit, it will not cover many use cases. For example, it doesn't work for a hex color string.

type HexChar = '0' | '1' | '2' | '3' | '4' | '5' | '6'| '7' | '8' | '9' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F';
type HexColor = `#${HexChar}${HexChar}${HexChar}${HexChar}${HexChar}${HexChar}`;
let color: HexColor = '#123456';

Today, that fails with "Expression produces a union type that is too complex to represent.(2590)"

There was some reference to this limitation in the release notes. It creates a list of all the possible valid combinations, in this case it would create a union with 16,777,216 (i.e., 16^6) members.

@thehappycheese
Copy link

@thehappycheese thehappycheese commented Oct 6, 2020

This is a great idea... Igmat made some incredible posts back in 2016 that look good on paper anyway.

I found this because I wanted to make sure the keys of an object literal passed into my function were valid css class names. I can easily check at runtime... but to me it seems so obvious that typescript should be able to do this at compile time, especially in situations where I am just hard-coding object literals and typescript shouldn't have to figure out if MyUnionExtendedExotictype satisfies SomeArbitraryRegexType.

Maybe one day I will be knowledgeable enough to make a more productive contribution :/

@rozzzly
Copy link

@rozzzly rozzzly commented Oct 14, 2020

I confirmed this comment from @rozzzly works with TS 4.1.0 nightly!

Wow. I honestly did not expect to see this get implemented, not anytime soon at least.

@chadlavi-casebook

There was some reference to this limitation in the release notes. It creates a list of all the possible valid combinations, in this case it would create a union with 16,777,216 (i.e., 16^6) members.

I'd be curious to see how large that union could get before it became a problem performance wise. @styfle's example shows how easy it is to hit that ceiling. There's obviously going to be a some degree of diminishing returns of usefulness of complex types vs performance.

@thehappycheese

I wanted to make sure the keys of an object literal passed into my function were valid css class names

I'm fairly confident in saying that it's not possible with the current implementation. If there was support for quantifiers and ranges you would probably get validation for BEM style class names. The standard js regex for that isn't too terrible:
^\.[a-z]([a-z0-9-]+)?(__([a-z0-9]+-?)+)?(--([a-z0-9]+-?)+){0,2}$
You would also ditch the anchors because as the implementation stands, it's either an end-to-end match or nothing so ^ and $ are implied. Now that's a comparatively simple regex for a narrow subset of what is a valid css selector. For example: ಠ_ಠ is a valid class name. I'm not kidding. CSS selectors are very permissive swhich makes them extremely difficult to validate. So your desire is probably out of scope for template literal types, at least for the foreseeable future. 😞

@AnyhowStep
Copy link
Contributor

@AnyhowStep AnyhowStep commented Oct 17, 2020

I'm sorry. I had to do this.

I implemented regular languages in TypeScript.

More accurately, I implemented a simple deterministic finite automaton using TS 4.1

I mean, we can already implement Turing machines in TS. So, DFAs and PDAs are "easy", compared to that.

And template strings make this more usable.


The core types are actually simple and fit in < 30 LOC,

type Head<StrT extends string> = StrT extends `${infer HeadT}${string}` ? HeadT : never;

type Tail<StrT extends string> = StrT extends `${string}${infer TailT}` ? TailT : never;

interface Dfa {
    startState : string,
    acceptStates : string,
    transitions : Record<string, Record<string, string>>,
}

type AcceptsImpl<
    DfaT extends Dfa,
    StateT extends string,
    InputT extends string
> =
    InputT extends "" ?
    (StateT extends DfaT["acceptStates"] ? true : false) :
    AcceptsImpl<
        DfaT,
        DfaT["transitions"][StateT][Head<InputT>],
        Tail<InputT>
    >;

type Accepts<DfaT extends Dfa, InputT extends string> = AcceptsImpl<DfaT, DfaT["startState"], InputT>;

It's specifying the automatons that's the hard part.

But I'm pretty sure someone can make a regex to TypeScript DFA™ generator...


I'd also like to highlight that the "hex string of length 6" example shows you can make function parameters only accept strings matching the regex using ugly hackery,

declare function takesOnlyHex<StrT extends string> (
    hexString : Accepts<HexStringLen6, StrT> extends true ? StrT : {__err : `${StrT} is not a hex-string of length 6`}
) : void;

//OK
takesOnlyHex("DEADBE")

//Error: Argument of type 'string' is not assignable to parameter of type '{ __err: "DEADBEEF is not a hex-string of length 6"; }'.
takesOnlyHex("DEADBEEF")

//OK
takesOnlyHex("01A34B")

//Error: Argument of type 'string' is not assignable to parameter of type '{ __err: "01AZ4B is not a hex-string of length 6"; }'.
takesOnlyHex("01AZ4B")

Here's a bonus Playground; it implements the regex /^hello .*/

And another Playground; it implements the regex / world$/

One final example, Playground; this is a floating point string regex!

@Harpush
Copy link

@Harpush Harpush commented Oct 17, 2020

@AnyhowStep Well i used your DFA idea to implement a simple regex [abc]{4} which means the letters abc in any order with missing but exactly the length of 4. (aaaa, abcc, bbcc, etc...).
Playground

@AnyhowStep
Copy link
Contributor

@AnyhowStep AnyhowStep commented Oct 17, 2020

https://cyberzhg.github.io/toolbox/min_dfa?regex=ZCgoYmQqYiopKmMpKg==

https://github.com/CyberZHG/toolbox

If I had more willpower, I'd grab something like the above and use it to turn regexes into TS DFAs™ lol

@AnyhowStep
Copy link
Contributor

@AnyhowStep AnyhowStep commented Oct 17, 2020

Okay, I just threw together a prototype,

https://glitch.com/~sassy-valiant-heath

[Edit] https://glitch.com/~efficacious-valley-repair <-- This produces way better output for more complicated regexes

[Edit] It seems like Glitch will archive free projects that are inactive for too long. So, here's a git repo with the files,
https://github.com/AnyhowStep/efficacious-valley-repair/tree/main/app

Step 1, key in your regex here,
image

Step 2, click convert,
image

Step 3, click the generated TS playground URL,
image

Step 4, scroll down till InLanguage_0,
image

Step 5, play with input values,
image

image

Shoutout to @kpdyer , author of https://www.npmjs.com/package/regex2dfa , for doing the heavy lifting of the conversion

@iTob191
Copy link

@iTob191 iTob191 commented Oct 18, 2020

In case someone needs something a little more powerful, here's a Turing machine 😆

Playground

@RyanCavanaugh
Copy link
Member

@RyanCavanaugh RyanCavanaugh commented Oct 19, 2020

This thread has gotten too long to read and many of the comments are either addressed by template literal types or are off-topic. I've created a new issue #41160 for discussion of what remaining use cases might be enabled by this feature. Feel free to continue discussing type system parsers here 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

You can’t perform that action at this time.