Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for generating email addresses #162

Closed
DRMacIver opened this issue Sep 25, 2015 · 34 comments · Fixed by #1320
Closed

Support for generating email addresses #162

DRMacIver opened this issue Sep 25, 2015 · 34 comments · Fixed by #1320
Labels
new-feature entirely novel capabilities or strategies

Comments

@DRMacIver
Copy link
Member

DRMacIver commented Sep 25, 2015

The django integration currently relies on the simple email strategy in provisional.py. It would be nice to have a native email address type which was a bit better at hitting edge cases.

When this issue was first opened, we relied on the fake-factory package for Django email fields, which was substantially slower and less devious than Hypothesis strategies.
That has been fixed, but we'd still like to improve the email strategy before making it part of the public API.

Anyone working on this issue should start with the provisional strategy, move it to strategies.py, and gradually expand it to generate unusual things allowed by the relevant RFCs (see below). Very obscure features - such as >255 character addresses, or using an IP address instead of a domain name - should be avoided, or at least gated behind an optional argument (eg emails(allow_obscure=False)).

@DRMacIver DRMacIver added enhancement it's not broken, but we want it to be better help-wanted labels Sep 25, 2015
@waynew
Copy link

waynew commented Oct 8, 2015

Should these just generate valid email addresses or invalid ones, too? I know I've recently read that (supposedly) you can send an email to someone@gmail.com or someone+something.else@gmail.com and the +... part gets ignored but still delivered.

@The-Compiler
Copy link
Contributor

Yeah, that works with GMail and some other setups as well.

There are much more fun addresses which are valid, though. From this blog article those are all valid according to the RFCs:

  • "Abc\@def"@example.com
  • "Fred Bloggs"@example.com
  • "Joe\\Blow"@example.com
  • "Abc@def"@example.com
  • customer/department=shipping@example.com
  • \$A12345@example.com
  • !def!xyz%abc@example.com
  • _somename@example.com

@kxepal
Copy link
Member

kxepal commented Oct 8, 2015

I think email strategy should be flexible enough to let users provide custom strategies for local and domain parts. Then you can construct strategies that fulfils specific rules per email provider: gmail strategy, hotmail strategy etc.

@DRMacIver
Copy link
Member Author

There's sort of a sliding scale of how much work this feature is from "Something pretty simple" to "Possibly as much work as the rest of Hypothesis's strategies combined". ;-)

I suspect the former is more realistic as a starting point. I agree with @kxepal that it's important to be able to plug in custom strategies, and think it might make sense for Hypothesis to have a bunch of common provider strategies out of the box.

@maxnordlund
Copy link
Contributor

I like to take a stab at this. I need it for work anyways, and why not share? To adhere to the KISS principle I think a first pass would be to generate "normal" email addresses. So just something matching the regular expression [A-Za-z0-9][-+._A-Za-z0-9]*@([A-Za-z0-9]+\.)+[A-Za-z]{2,}. But slightly more complicated to properly handle unicode, use a list for top level domains etc.

For something closer to the actual standard see Djangos validator. But for the vast majority I think this simplified version should be good enough™

Sounds good? Or is this already in and I completely missed it?

@DRMacIver
Copy link
Member Author

It's semi already in - there's currently a purely internal version hidden inside the Django module which is good enough to use but I felt needed more work/thought to make public. If you wanted to do that work it would be very gratefully received!

@maxnordlund
Copy link
Contributor

I assume you mean https://github.com/HypothesisWorks/hypothesis-python/blob/master/src/hypothesis/extra/django/models.py#L98-L122?

I think it might be a bit bigger then that one, should it live directly under extra then?

@DRMacIver
Copy link
Member Author

I think it might be a bit bigger then that one

I agree. That's why it isn't ready to be made public without more work!

The goal of that was to remove the dependency on faker's email address support (which was causing a major performance problem for Django users), so it only had to be not worse than that to be an improvement.

should it live directly under extra then?

No, it's fine for it to go in hypothesis.strategies. extra is for packages that have external dependencies.

@maxnordlund
Copy link
Contributor

So after a bit more work I've added a constant with all current top-level domain names. But it's kinda big, 1561, and I don't like it living in the class, but how/where do I put it instead?

@DRMacIver
Copy link
Member Author

Most large strategies live in their own files in hypothesis/strategies. If you put it in its own file it's fine for there to be a large top level constant in the module.

I thought we had an example of a non-python data file somewhere in the project, but I can't currently find it if so.

@maxnordlund
Copy link
Contributor

You mean the unicode pickle thing? It's autogenerated from the unicode module, but this has to be checked in. I'll make it its on file then, and have the constant be in there.

@DRMacIver
Copy link
Member Author

Yeah, I thought there was something else related to that, but I think in the end either I'm imagining things or it's something that went away.

@krzysztof-sikorski
Copy link

I may be completely confused, but would it not be better to use public suffix list instead of hard-coded list?

@maxnordlund
Copy link
Contributor

Neat, I didn't know about that one. I'll definitely take a look. I could automatically download the list the first time, but that means internet access, or use the https://github.com/publicsuffix/list and package it for pypi. I think the latter is nicer, but I'll need some help with that.

@jwg4
Copy link
Contributor

jwg4 commented Mar 24, 2017

I don't understand why not having a hard-coded list is important enough to merit either making an HTTP request at run time, or adding a dependency to another package? The purpose of the public suffix list seems to be something else altogether, and it's not obvious why it should be used here. The need here is just for a large, representative set of sample domains.

@maxnordlund
Copy link
Contributor

If I'm understanding it correctly, it doesn't just contain all top-level domains, but also information about which second-level ones are accepted. For instance co.uk and not just *.uk. It also contains a bunch more about cookies, but that can be easily ignored.

It will still be a hardcoded list, just not maintained by us. And if we ship it together with the rest of the code, it won't add any external dependencies for the end user. What's neat about it is that as stuff changes, those will be picked up by that list, and in turn by hypothesis, instead of manually going in on https://www.iana.org/domains/root/db and run JSON.stringify($$(".tld").map((el) => el.textContent.slice(1))) in the console.

@DRMacIver
Copy link
Member Author

I'm pretty strongly against having Hypothesis make HTTP requests. It would definitely be good to automate the process of updating any lists from external data though.

@maxnordlund
Copy link
Contributor

maxnordlund commented Mar 24, 2017

Okay ​so a status update. I've got most of code in place, but it's just a sketch for now. I tried to get away with just using MappedStrategy and composition, but there are these size and dot/period constraints that make it difficult.

I think I have to take the plunge and use do_draw directly. When I get home from work I'll push what I have and you can take a look.

@DRMacIver
Copy link
Member Author

I think I have to take the plunge and use do_draw directly.

Feel free to just use composite if that's easier for you - there's not very much difference between the two for your purposes.

@maxnordlund
Copy link
Contributor

maxnordlund commented Mar 28, 2017

Unfortunately that's not quite the case, I need to either filter out a lot of examples, or build it up by hand since I have so many size limitations. For domains it's not just 1-127 subdomains with 1-63 characters, but also that the total length has to be below 253 characters long.

Similarly for emails there are some funky rules regarding dots (and quoted text but I'll just ignore that for now). There are two size restrictions, the local part, aka before the @, must be at most 64 characters, and the entire address 254.

However I think it's not to hard to generate this by just slowly building up the various parts until you hit the size limit(s) and then return. It might be possible with some clever use of flatMap, but I'll have to investigate that further.

Something like:

def do_draw():
  min_size = random.randrange(1, 127)
  max_size random.randrange(min_size, 127)

  top_domain = choose(LIST_OF_SUFFIXES)
  labels = [] # subdomains in RFC lingo
  total = len(top_domain) + 1

  while True:
    label = draw(text(min_size=1, max_size=max_size, alphabet=valid_characters))
    total += len(label) + 1
    if total > max_size:
      break
    labels.append(label)

  labels.append(top_domain)
  return ".".join(labels)
def do_draw():
  domain = draw(domain)

  min_size = random.randint(1, 64)
  max_size random.randint(min_size, 64)

  parts = []
  total = 0
  
  while True:
    part = draw(text(min_size=1, max_size=max_size, alphabet=valid_characters))
    if part[0] == ".":
      part[0] = part[1:]
    if parts[-1][-1] != "." and random.random() < 0.5: # or whatever
      part += "."

    total += len(part)
    if total > max_size or total + 1 + len(domain) > 254:
      break
    parts.append(part)

  if parts[-1][-1] == ".":
    parts[-1] = parts[-1][:-2]

  assume(len(parts) > min_size)

  return "%s@%s" % ("".join(parts), domain)

@maximkulkin
Copy link
Contributor

Guys, FYI I created an extension to Hypothesis to generate data based on regular expression - hypothesis-regex. Generating email is a perfect use case for that library. Give it a try.

I'm still working on it to address some rough edges and once done it will hopefully be merged in Hypothesis.

@maxnordlund
Copy link
Contributor

Lovely, my giant project at work is finally over (pending review) so I can get back to this. I think a combination of my domain generator plus your regex would be perfect. Just gotta tinker with it 😉

@Zac-HD
Copy link
Member

Zac-HD commented Jun 1, 2017

Looking forward to it 😄

Unfortunately regex is not a good fit for generating the really "fun" email addresses that will break people's code, like ~."@".\ @example.com - see RFC3693s3 and errata for more - which we would like to do.

@maximkulkin
Copy link
Contributor

@Zac-HD It's all up to how creative you can be with your email regex. The one that I have in hypothesis-regex's README is just simplified version for demo purposes.

@Zac-HD
Copy link
Member

Zac-HD commented Jun 1, 2017

Oh, it's certainly possible! Probably easier to get the right minimising behaviour by writing a strategy than a regex though, and likely to be more efficient. 😜

@maximkulkin
Copy link
Contributor

maximkulkin commented Jun 1, 2017

Well, IMO hand written strategy will mimic the same structure as regex-based strategy, but regex provide a more compact syntax (DSL) to define them.

@maxnordlund
Copy link
Contributor

Yeah, I figured as much but it doesn't hurt to try. That should also iron out the bugs and whatnot for the regex package which I consider a good thing ™️

@Zac-HD Zac-HD added new-feature entirely novel capabilities or strategies and removed enhancement it's not broken, but we want it to be better labels Jun 10, 2017
@Zac-HD Zac-HD added this to the 3.x milestone Mar 31, 2018
@Zac-HD
Copy link
Member

Zac-HD commented Mar 31, 2018

I've added this to the 3.x milestone, because with the Faker integration going away many users will be looking for ways to generate email addresses.

Literally just moving the provisional emails() strategy to the public API would be sufficient to meet this need.

@bukzor
Copy link
Contributor

bukzor commented Apr 6, 2018

Would you mind a optional dependency on yelp_uri to implement this? It would be in the same spirit as the optional dependency on pytz. This would allow for a strategy that generates the full range of acceptable emails.

@Zac-HD
Copy link
Member

Zac-HD commented Apr 7, 2018

My strong preference would be to implement it with just the standard library - the provisional strategy already works, and there's a pretty clear upgrade path to improve each part of it.

@Zac-HD
Copy link
Member

Zac-HD commented Jun 8, 2018

This issue is reserved for @sushobhit27 until the end of June 2018.

Moving the provisional emails() strategy to the public API would be sufficient to meet this need.

That means moving it from provisional.py to strategies.py, along with a changelog, documentation, a changelog (see guides/documentation.rst), adding yourself to the authors list in CONTRIBUTING.rst, and moving the tests. Please don't hesitate to ask for help if you get stuck!

@sushobhit27
Copy link
Contributor

@Zac-HD shall I move just emails strategy or also move all strategies (like ip4_addr_strings, domains etc) from the file.

@Zac-HD
Copy link
Member

Zac-HD commented Jun 9, 2018

Just emails - the others are not public API.

@sushobhit27
Copy link
Contributor

@Zac-HD Thanks a ton for walking me through all the small steps for this PR, and a big thanks to all contributors of hypothesis, its really a great library which although I started using recently but found to be of immense help, while writing test cases. Thanks!! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature entirely novel capabilities or strategies
Projects
None yet
Development

Successfully merging a pull request may close this issue.