Skip to content

Commit

Permalink
Merge pull request #2066 from techdragon/better-domain-name-strategy
Browse files Browse the repository at this point in the history
Created a better domain name strategy
  • Loading branch information
Zac-HD committed Aug 20, 2019
2 parents 000387f + 60f69ee commit 70f3e4c
Show file tree
Hide file tree
Showing 10 changed files with 1,705 additions and 37 deletions.
1 change: 1 addition & 0 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,7 @@ their individual contributions.
* `Richard Boulton <https://www.github.com/rboulton>`_ (richard@tartarus.org)
* `Ryan Soklaski <https://www.github.com/rsokl>`_ (rsoklaski@gmail.com)
* `Ryan Turner <https://github.com/rdturnermtl>`_ (ryan.turner@uber.com)
* `Sam Bishop (TechDragon) <https://github.com/techdragon>`_ (sam@techdragon.io)
* `Sam Hames <https://www.github.com/SamHames>`_
* `Sanyam Khurana <https://github.com/CuriousLearner>`_
* `Saul Shanabrook <https://www.github.com/saulshanabrook>`_ (s.shanabrook@gmail.com)
Expand Down
11 changes: 11 additions & 0 deletions hypothesis-python/RELEASE.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
RELEASE_TYPE: minor

This release improves the :func:`~hypothesis.provisional.domains`
strategy, as well as the :func:`~hypothesis.provisional.urls` and
the :func:`~hypothesis.strategies.emails` strategies which use it.
These strategies now use the full IANA list of Top Level Domains
and are correct as per :rfc:`1035`.

Passing tests using these strategies may now fail.

Thanks to `TechDragon <https://github.com/techdragon>`__ for this improvement.
17 changes: 14 additions & 3 deletions hypothesis-python/docs/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,24 @@ and how to build them. Strategies have a variety of other important internal
features, such as how they simplify, but the data they can generate is the only
public part of their API.

~~~~~~~~~~~~~~~
Core Strategies
~~~~~~~~~~~~~~~

Functions for building strategies are all available in the hypothesis.strategies
module. The salient functions from it are as follows:

.. automodule:: hypothesis.strategies
:members:
:exclude-members: SearchStrategy

~~~~~~~~~~~~~~~~~~~~~~
Provisional Strategies
~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: hypothesis.provisional
:members:

.. _shrinking:

~~~~~~~~~
Expand Down Expand Up @@ -171,9 +182,9 @@ The problem is that you cannot call a strategy recursively and expect it to not
blow up and eat all your memory. The other problem here is that not all unicode strings
display consistently on different machines, so we'll restrict them in our doctest.

The way Hypothesis handles this is with the :py:func:`recursive` function
which you pass in a base case and a function that, given a strategy for your data type,
returns a new strategy for it. So for example:
The way Hypothesis handles this is with the :func:`~hypothesis.strategies.recursive`
strategy which you pass in a base case and a function that, given a strategy
for your data type, returns a new strategy for it. So for example:

.. code-block:: pycon
Expand Down
2 changes: 1 addition & 1 deletion hypothesis-python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def local_file(name):
author_email="david@drmaciver.com",
packages=setuptools.find_packages(SOURCE),
package_dir={"": SOURCE},
package_data={"hypothesis": ["py.typed"]},
package_data={"hypothesis": ["py.typed", "vendor/tlds-alpha-by-domain.txt"]},
url="https://github.com/HypothesisWorks/hypothesis/tree/master/hypothesis-python",
project_urls={
"Website": "https://hypothesis.works",
Expand Down
148 changes: 125 additions & 23 deletions hypothesis-python/src/hypothesis/provisional.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,62 +20,164 @@
It is intended for internal use, to ease code reuse, and is not stable.
Point releases may move or break the contents at any time!
Internet strategies should conform to https://tools.ietf.org/html/rfc3696 or
the authoritative definitions it links to. If not, report the bug!
Internet strategies should conform to :rfc:`3986` or the authoritative
definitions it links to. If not, report the bug!
"""
# https://tools.ietf.org/html/rfc3696

from __future__ import absolute_import, division, print_function

import os.path
import string

import hypothesis._strategies as st
import hypothesis.internal.conjecture.utils as cu
from hypothesis.errors import InvalidArgument
from hypothesis.searchstrategy.strategies import SearchStrategy

if False:
from typing import Text # noqa
from hypothesis.searchstrategy.strategies import SearchStrategy, Ex # noqa


URL_SAFE_CHARACTERS = frozenset(string.ascii_letters + string.digits + "$-_.+!*'(),")


# This file is sourced from http://data.iana.org/TLD/tlds-alpha-by-domain.txt
# The file contains additional information about the date that it was last updated.
with open(
os.path.join(os.path.dirname(__file__), "vendor", "tlds-alpha-by-domain.txt")
) as tld_file:
__header = next(tld_file)
assert __header.startswith("#")
TOP_LEVEL_DOMAINS = sorted((line.rstrip() for line in tld_file), key=len)
TOP_LEVEL_DOMAINS.insert(0, "COM")


class DomainNameStrategy(SearchStrategy):
@staticmethod
def clean_inputs(minimum, maximum, value, variable_name):
if value is None:
value = maximum
elif not isinstance(value, int):
raise InvalidArgument(
"Expected integer but %s is a %s"
% (variable_name, type(value).__name__)
)
elif not minimum <= value <= maximum:
raise InvalidArgument(
"Invalid value %r < %s=%r < %r"
% (minimum, variable_name, value, maximum)
)
return value

def __init__(self, max_length=None, max_element_length=None):
"""
A strategy for :rfc:`1035` fully qualified domain names.
The upper limit for max_length is 255 in accordance with :rfc:`1035#section-2.3.4`
The lower limit for max_length is 4, corresponding to a two letter domain
with a single letter subdomain.
The upper limit for max_element_length is 63 in accordance with :rfc:`1035#section-2.3.4`
The lower limit for max_element_length is 1 in accordance with :rfc:`1035#section-2.3.4`
"""
# https://tools.ietf.org/html/rfc1035#section-2.3.4

max_length = self.clean_inputs(4, 255, max_length, "max_length")
max_element_length = self.clean_inputs(
1, 63, max_element_length, "max_element_length"
)

super(DomainNameStrategy, self).__init__()
self.max_length = max_length
self.max_element_length = max_element_length

# These regular expressions are constructed to match the documented
# information in https://tools.ietf.org/html/rfc1035#section-2.3.1
# which defines the allowed syntax of a subdomain string.
if self.max_element_length == 1:
self.label_regex = r"[a-zA-Z]"
elif self.max_element_length == 2:
self.label_regex = r"[a-zA-Z][a-zA-Z0-9]?"
else:
maximum_center_character_pattern_repetitions = self.max_element_length - 2
self.label_regex = r"[a-zA-Z]([a-zA-Z0-9\-]{0,%d}[a-zA-Z0-9])?" % (
maximum_center_character_pattern_repetitions,
)

def do_draw(self, data):
# 1 - Select a valid top-level domain (TLD) name
# 2 - Check that the number of characters in our selected TLD won't
# prevent us from generating at least a 1 character subdomain.
# 3 - Randomize the TLD between upper and lower case characters.
domain = data.draw(
st.sampled_from(TOP_LEVEL_DOMAINS)
.filter(lambda tld: len(tld) + 2 <= self.max_length)
.flatmap(
lambda tld: st.tuples(
*[st.sampled_from([c.lower(), c.upper()]) for c in tld]
).map(u"".join)
)
)
# The maximum possible number of subdomains is 126,
# 1 character subdomain + 1 '.' character, * 126 = 252,
# with a max of 255, that leaves 3 characters for a TLD.
# Allowing any more subdomains would not leave enough
# characters for even the shortest possible TLDs.
elements = cu.many(data, min_size=1, average_size=1, max_size=126)
while elements.more():
# Generate a new valid subdomain using the regex strategy.
sub_domain = data.draw(st.from_regex(self.label_regex, fullmatch=True))
if len(domain) + len(sub_domain) >= self.max_length:
data.stop_example(discard=True)
break
domain = sub_domain + "." + domain
return domain


@st.defines_strategy_with_reusable_values
def domains(
max_length=255, # type: int
max_element_length=63, # type: int
):
# type: (...) -> SearchStrategy[Text]
"""Generate :rfc:`1035` compliant fully qualified domain names."""
return DomainNameStrategy(
max_length=max_length, max_element_length=max_element_length
)


@st.defines_strategy_with_reusable_values
def urls():
# type: () -> SearchStrategy[Text]
"""A strategy for :rfc:`3986`, generating http/https URLs."""

def url_encode(s):
safe_chars = set(string.ascii_letters + string.digits + "$-_.+!*'(),")
return "".join(c if c in safe_chars else "%%%02X" % ord(c) for c in s)
return "".join(c if c in URL_SAFE_CHARACTERS else "%%%02X" % ord(c) for c in s)

schemes = st.sampled_from(["http", "https"])
ports = st.integers(min_value=0, max_value=2 ** 16 - 1).map(":{}".format)
paths = st.lists(st.text(string.printable).map(url_encode)).map(
lambda path: "/".join([""] + path)
)
paths = st.lists(st.text(string.printable).map(url_encode)).map("/".join)

return st.builds(
"{}://{}{}{}".format, schemes, domains(), st.one_of(st.just(""), ports), paths
u"{}://{}{}/{}".format, schemes, domains(), st.just(u"") | ports, paths
)


@st.defines_strategy_with_reusable_values
def domains():
"""A strategy for :rfc:`1035` fully qualified domain names."""
atoms = st.text(
string.ascii_letters + "0123456789-", min_size=1, max_size=63
).filter(lambda s: "-" not in s[0] + s[-1])
return st.builds(
lambda x, y: ".".join(x + [y]),
st.lists(atoms, min_size=1),
# TODO: be more devious about top-level domains
st.sampled_from(["com", "net", "org", "biz", "info"]),
).filter(lambda url: len(url) <= 255)


@st.defines_strategy_with_reusable_values
def ip4_addr_strings():
# type: () -> SearchStrategy[Text]
"""A strategy for IPv4 address strings.
This consists of four strings representing integers [0..255],
without zero-padding, joined by dots.
"""
return st.builds("{}.{}.{}.{}".format, *(4 * [st.integers(0, 255)]))
return st.builds(u"{}.{}.{}.{}".format, *(4 * [st.integers(0, 255)]))


@st.defines_strategy_with_reusable_values
def ip6_addr_strings():
# type: () -> SearchStrategy[Text]
"""A strategy for IPv6 address strings.
This consists of sixteen quads of hex digits (0000 .. FFFF), joined
Expand Down

0 comments on commit 70f3e4c

Please sign in to comment.