-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for generating email addresses #162
Comments
|
Should these just generate valid email addresses or invalid ones, too? I know I've recently read that (supposedly) you can send an email to |
|
Yeah, that works with GMail and some other setups as well. There are much more fun addresses which are valid, though. From this blog article those are all valid according to the RFCs:
|
|
I think email strategy should be flexible enough to let users provide custom strategies for local and domain parts. Then you can construct strategies that fulfils specific rules per email provider: gmail strategy, hotmail strategy etc. |
|
There's sort of a sliding scale of how much work this feature is from "Something pretty simple" to "Possibly as much work as the rest of Hypothesis's strategies combined". ;-) I suspect the former is more realistic as a starting point. I agree with @kxepal that it's important to be able to plug in custom strategies, and think it might make sense for Hypothesis to have a bunch of common provider strategies out of the box. |
|
I like to take a stab at this. I need it for work anyways, and why not share? To adhere to the KISS principle I think a first pass would be to generate "normal" email addresses. So just something matching the regular expression For something closer to the actual standard see Djangos validator. But for the vast majority I think this simplified version should be good enough™ Sounds good? Or is this already in and I completely missed it? |
|
It's semi already in - there's currently a purely internal version hidden inside the Django module which is good enough to use but I felt needed more work/thought to make public. If you wanted to do that work it would be very gratefully received! |
|
I assume you mean https://github.com/HypothesisWorks/hypothesis-python/blob/master/src/hypothesis/extra/django/models.py#L98-L122? I think it might be a bit bigger then that one, should it live directly under extra then? |
I agree. That's why it isn't ready to be made public without more work! The goal of that was to remove the dependency on faker's email address support (which was causing a major performance problem for Django users), so it only had to be not worse than that to be an improvement.
No, it's fine for it to go in hypothesis.strategies. extra is for packages that have external dependencies. |
|
So after a bit more work I've added a constant with all current top-level domain names. But it's kinda big, 1561, and I don't like it living in the class, but how/where do I put it instead? |
|
Most large strategies live in their own files in hypothesis/strategies. If you put it in its own file it's fine for there to be a large top level constant in the module. I thought we had an example of a non-python data file somewhere in the project, but I can't currently find it if so. |
|
You mean the unicode pickle thing? It's autogenerated from the unicode module, but this has to be checked in. I'll make it its on file then, and have the constant be in there. |
|
Yeah, I thought there was something else related to that, but I think in the end either I'm imagining things or it's something that went away. |
|
I may be completely confused, but would it not be better to use public suffix list instead of hard-coded list? |
|
Neat, I didn't know about that one. I'll definitely take a look. I could automatically download the list the first time, but that means internet access, or use the https://github.com/publicsuffix/list and package it for pypi. I think the latter is nicer, but I'll need some help with that. |
|
I don't understand why not having a hard-coded list is important enough to merit either making an HTTP request at run time, or adding a dependency to another package? The purpose of the public suffix list seems to be something else altogether, and it's not obvious why it should be used here. The need here is just for a large, representative set of sample domains. |
|
If I'm understanding it correctly, it doesn't just contain all top-level domains, but also information about which second-level ones are accepted. For instance It will still be a hardcoded list, just not maintained by us. And if we ship it together with the rest of the code, it won't add any external dependencies for the end user. What's neat about it is that as stuff changes, those will be picked up by that list, and in turn by hypothesis, instead of manually going in on https://www.iana.org/domains/root/db and run |
|
I'm pretty strongly against having Hypothesis make HTTP requests. It would definitely be good to automate the process of updating any lists from external data though. |
|
Okay so a status update. I've got most of code in place, but it's just a sketch for now. I tried to get away with just using I think I have to take the plunge and use |
Feel free to just use composite if that's easier for you - there's not very much difference between the two for your purposes. |
|
Unfortunately that's not quite the case, I need to either filter out a lot of examples, or build it up by hand since I have so many size limitations. For domains it's not just 1-127 subdomains with 1-63 characters, but also that the total length has to be below 253 characters long. Similarly for emails there are some funky rules regarding dots (and quoted text but I'll just ignore that for now). There are two size restrictions, the local part, aka before the @, must be at most 64 characters, and the entire address 254. However I think it's not to hard to generate this by just slowly building up the various parts until you hit the size limit(s) and then return. It might be possible with some clever use of Something like: def do_draw():
min_size = random.randrange(1, 127)
max_size random.randrange(min_size, 127)
top_domain = choose(LIST_OF_SUFFIXES)
labels = [] # subdomains in RFC lingo
total = len(top_domain) + 1
while True:
label = draw(text(min_size=1, max_size=max_size, alphabet=valid_characters))
total += len(label) + 1
if total > max_size:
break
labels.append(label)
labels.append(top_domain)
return ".".join(labels)def do_draw():
domain = draw(domain)
min_size = random.randint(1, 64)
max_size random.randint(min_size, 64)
parts = []
total = 0
while True:
part = draw(text(min_size=1, max_size=max_size, alphabet=valid_characters))
if part[0] == ".":
part[0] = part[1:]
if parts[-1][-1] != "." and random.random() < 0.5: # or whatever
part += "."
total += len(part)
if total > max_size or total + 1 + len(domain) > 254:
break
parts.append(part)
if parts[-1][-1] == ".":
parts[-1] = parts[-1][:-2]
assume(len(parts) > min_size)
return "%s@%s" % ("".join(parts), domain) |
|
Guys, FYI I created an extension to Hypothesis to generate data based on regular expression - hypothesis-regex. Generating email is a perfect use case for that library. Give it a try. I'm still working on it to address some rough edges and once done it will hopefully be merged in Hypothesis. |
|
Lovely, my giant project at work is finally over (pending review) so I can get back to this. I think a combination of my domain generator plus your regex would be perfect. Just gotta tinker with it 😉 |
|
@Zac-HD It's all up to how creative you can be with your email regex. The one that I have in hypothesis-regex's README is just simplified version for demo purposes. |
|
Oh, it's certainly possible! Probably easier to get the right minimising behaviour by writing a strategy than a regex though, and likely to be more efficient. 😜 |
|
Well, IMO hand written strategy will mimic the same structure as regex-based strategy, but regex provide a more compact syntax (DSL) to define them. |
|
Yeah, I figured as much but it doesn't hurt to try. That should also iron out the bugs and whatnot for the regex package which I consider a good thing ™️ |
|
I've added this to the 3.x milestone, because with the Faker integration going away many users will be looking for ways to generate email addresses. Literally just moving the provisional |
|
Would you mind a optional dependency on |
|
My strong preference would be to implement it with just the standard library - the provisional strategy already works, and there's a pretty clear upgrade path to improve each part of it. |
|
This issue is reserved for @sushobhit27 until the end of June 2018.
That means moving it from provisional.py to strategies.py, along with a changelog, documentation, a changelog (see guides/documentation.rst), adding yourself to the authors list in CONTRIBUTING.rst, and moving the tests. Please don't hesitate to ask for help if you get stuck! |
|
@Zac-HD shall I move just emails strategy or also move all strategies (like ip4_addr_strings, domains etc) from the file. |
|
Just emails - the others are not public API. |
|
@Zac-HD Thanks a ton for walking me through all the small steps for this PR, and a big thanks to all contributors of hypothesis, its really a great library which although I started using recently but found to be of immense help, while writing test cases. Thanks!! :) |
The django integration currently relies on the simple email strategy in
provisional.py. It would be nice to have a native email address type which was a bit better at hitting edge cases.When this issue was first opened, we relied on the
fake-factorypackage for Django email fields, which was substantially slower and less devious than Hypothesis strategies.That has been fixed, but we'd still like to improve the email strategy before making it part of the public API.
Anyone working on this issue should start with the provisional strategy, move it to
strategies.py, and gradually expand it to generate unusual things allowed by the relevant RFCs (see below). Very obscure features - such as >255 character addresses, or using an IP address instead of a domain name - should be avoided, or at least gated behind an optional argument (egemails(allow_obscure=False)).The text was updated successfully, but these errors were encountered: