# Resolve Domains in DNSDB

Having done clustering of the registrable domains, we'd like to explore the subdomains of the registrable domains.

First step: we need to collect them all from DNSDB. After that, we will want to cluster the subdomains, but that can be done in a separate notebook. Here we'll just deal with querying the domains.  

### How to resolve everything in DNSDB.

Borrowing from Joe St. Sauver's examples on bulk-querying DNSDB, I should be able to do this fairly quickly. The trick is going to be finding the right time fence. I think what I need to do is find the earliest create-date from the whois entries of the registrable domains, and go from there until today. The create-date should be in the domain information we dumped from Clay's investigation in Iris, so let's grab that. 

In [1]:
import json
import datetime

In [2]:
with open("disneyland_domains.json") as infile:
    data = json.load(infile)

In [3]:
data.keys()

dict_keys(['response'])

In [4]:
domains = data["response"]["results"]

In [5]:
min_date = datetime.date.today()
for entry in domains:
    create_string = entry['create_date']['value']
    create_date = datetime.datetime.strptime(create_string, "%Y-%m-%d").date()
    if create_date < min_date:
        min_date = create_date

In [6]:
min_date

datetime.date(2019, 5, 16)

In [7]:
# frustratingly, you have to jump through some hoops to get a UTC-timezone'd timestamp
start_time = datetime.datetime.combine(min_date, datetime.datetime.min.time(), tzinfo=datetime.timezone.utc).timestamp()
start_time

1557964800.0

Okay, that's my min starting time for these disneyland domains. No point in looking for anything earlier, since none of them were created before then. I'll could check with Clay to see if there is an end point, but for now, let's assume today.

In [8]:
end_time = datetime.datetime.utcnow().timestamp()
end_time

1679091726.496638

## Actually Querying DNSDB

In [9]:
import requests
import time
from time import strftime, gmtime
from collections import defaultdict
import idna

In [10]:
def get_api_key():
    """ Retrieve the DNSDB API key """
    api_key_file_path = "/Users/ageeclough/.dnsdb-apikey.fordemo"
    try:
        with open(api_key_file_path, encoding='UTF-8') as my_api_file:
            val = my_api_file.readline().strip()
    except FileNotFoundError:
        exit("DNSDB API key should be in ~/.dnsdb-apikey.fordemo")
    return val

In [11]:
def remove_saf_entries(mylist):
    """ Remove the Streaming API Framing Records From the Results """
    mylist2 = mylist.splitlines()
    # Strip the streaming format bookend records
    if mylist2[0] == '{"cond":"begin"}':
        mylist2.pop(0)
    if (mylist2[-1] == '{"cond":"succeeded"}') or (mylist2[-1] == '{"cond":"limited","msg":"Result limit reached"}'):
        mylist2.pop()
    return mylist2

In [12]:
def print_detailed_bits(myrecord):
    """format results for display"""
    myrecord_json_format = json.loads(myrecord)
    my_rrtype = myrecord_json_format['obj']['rrtype']
    # assume we're only interested in "A" records and CNAMEs
    if my_rrtype.rstrip() not in ("A", "CNAME"):
        return None
    my_rrname = myrecord_json_format['obj']['rrname']
    my_count = myrecord_json_format['obj']['count']
    my_rdata = myrecord_json_format['obj']['rdata']
    myformat = '%Y-%m-%d %H:%M'
    # time_last
    try:
        extract_tl = myrecord_json_format['obj']['time_last']
    except KeyError:
        extract_tl = myrecord_json_format['obj']['zone_time_last']
    enddatetime = strftime(myformat, gmtime(extract_tl))

    # time_first
    try:
        extract_tf = myrecord_json_format['obj']['time_first']
    except KeyError:
        extract_tf = myrecord_json_format['obj']['zone_time_first']
    startdatetime = strftime(myformat, gmtime(extract_tf))

    result = {
        "domain": my_rrname,
        "type": my_rrtype,
        "enddate": enddatetime,
        "startdate": startdatetime,
        "count": my_count,
        "rdata": my_rdata
    }
    return result

In [13]:
def make_query(myfqdn, my_apikey, start_time, end_time):
    """ Make the DNSDB query for myfqdn """
    # get the DNSDB API key
    url = "https://api-iad.dnsdb.info/dnsdb/v2/lookup/rrset/name/" + myfqdn + \
        f"?limit=1000000&time_last_after={int(start_time)}&time_first_before={int(end_time)}"
    myheaders = {'X-API-Key': my_apikey, 'Accept': 'application/jsonl'}
    r = requests.get(url, headers=myheaders, timeout=3600)
    # Status Code 200 == Success
    results = list()
    if r.status_code == 200:
        stripped_results = remove_saf_entries(r.text)
        for myfqdns in stripped_results:
            myline = print_detailed_bits(myfqdns)
            if myline is not None:
                results.append(myline)
        return results
    else:
        sys.stderr.write(myfqdn + " returned status code=" + str(r.status_code) + "\n")

In [14]:
api_key = get_api_key()

In [15]:
make_query("python.com.", api_key, start_time, end_time)

[{'domain': 'python.com.',
  'type': 'A',
  'enddate': '2023-03-17 09:18',
  'startdate': '2020-07-28 13:18',
  'count': 19014,
  'rdata': ['3.96.23.237']},
 {'domain': 'python.com.',
  'type': 'A',
  'enddate': '2019-10-28 22:03',
  'startdate': '2016-01-20 15:04',
  'count': 29289,
  'rdata': ['199.59.88.81']},
 {'domain': 'python.com.',
  'type': 'A',
  'enddate': '2020-07-28 02:07',
  'startdate': '2020-04-16 15:16',
  'count': 1693,
  'rdata': ['199.73.55.35']}]

In [16]:
make_query("*.python.com.", api_key, start_time, end_time)

[{'domain': 'python.com.',
  'type': 'A',
  'enddate': '2023-03-17 09:18',
  'startdate': '2020-07-28 13:18',
  'count': 19014,
  'rdata': ['3.96.23.237']},
 {'domain': 'python.com.',
  'type': 'A',
  'enddate': '2019-10-28 22:03',
  'startdate': '2016-01-20 15:04',
  'count': 29289,
  'rdata': ['199.59.88.81']},
 {'domain': 'python.com.',
  'type': 'A',
  'enddate': '2020-07-28 02:07',
  'startdate': '2020-04-16 15:16',
  'count': 1693,
  'rdata': ['199.73.55.35']},
 {'domain': 'www.python.com.',
  'type': 'A',
  'enddate': '2023-03-17 09:49',
  'startdate': '2020-07-28 19:09',
  'count': 3869,
  'rdata': ['3.96.23.237']},
 {'domain': 'www.python.com.',
  'type': 'A',
  'enddate': '2020-07-27 23:36',
  'startdate': '2020-04-16 08:52',
  'count': 599,
  'rdata': ['199.73.55.35']},
 {'domain': 'www.python.com.',
  'type': 'CNAME',
  'enddate': '2020-04-15 17:51',
  'startdate': '2010-06-24 17:58',
  'count': 49246,
  'rdata': ['python.com.']},
 {'domain': 'help.python.com.',
  'type': '

Nice, so that works well. One thing to note: we need to make separate queries for the apex of the domain and the subdomains of the apex. As we can see above, you get a different set of results when querying the two. For purposes of this experiment, we're mostly only interested in the subdomains, but for completeness' sake, we'll query both.

Now, to run that against every registrable domain in the set:

In [18]:
resolutions = defaultdict(dict)
for entry in domains:
    domain = entry['domain']
    if not domain.endswith("."):
        domain = domain + "."
    encoded = idna.encode(domain).decode("utf-8")
    apex = make_query(encoded, api_key, start_time, end_time)
    resolutions[domain]['apex'] = apex
    subdomains = make_query("*." + encoded, api_key, start_time, end_time)
    resolutions[domain]['subdomains'] = subdomains

This will take a while because it's single-threaded. Have a look at the Joe St Sauver paper to do multiple queries at once if you want things like this to go faster. 

Let's look at some of the results:

In [19]:
resolutions["ushaạnk.com."]

{'apex': [{'domain': 'xn--ushank-zc8b.com.',
   'type': 'A',
   'enddate': '2023-02-27 08:59',
   'startdate': '2023-02-10 03:12',
   'count': 18,
   'rdata': ['127.0.0.1']},
  {'domain': 'xn--ushank-zc8b.com.',
   'type': 'A',
   'enddate': '2023-01-23 13:33',
   'startdate': '2023-01-22 14:30',
   'count': 2,
   'rdata': ['209.99.40.222']}],
 'subdomains': [{'domain': 'xn--ushank-zc8b.com.',
   'type': 'A',
   'enddate': '2023-02-27 08:59',
   'startdate': '2023-02-10 03:12',
   'count': 18,
   'rdata': ['127.0.0.1']},
  {'domain': 'xn--ushank-zc8b.com.',
   'type': 'A',
   'enddate': '2023-01-23 13:33',
   'startdate': '2023-01-22 14:30',
   'count': 2,
   'rdata': ['209.99.40.222']},
  {'domain': 'www.xn--ushank-zc8b.com.',
   'type': 'A',
   'enddate': '2023-02-12 07:13',
   'startdate': '2023-02-12 07:13',
   'count': 2,
   'rdata': ['127.0.0.1']},
  {'domain': 'singlepoint.xn--ushank-zc8b.com.',
   'type': 'A',
   'enddate': '2022-10-29 04:07',
   'startdate': '2022-01-20 16:45'

Just so we don't have to do that again (and so we can use it elsewhere), let's write all this to disk.

In [20]:
with open("dnsdb_resolutions.json", "w") as outfile:
    json.dump(resolutions, fp=outfile)

next, let's just look at the names discovered, rather than the IPs and CNAME info. That can be interesting later, but for now let's just concentrate on the names themselves.

In [20]:
children = dict()
for domain in resolutions:
    if resolutions[domain]["apex"]:
        base = resolutions[domain]["apex"][0]["domain"]
    elif resolutions[domain]['subdomains']:
        base = resolutions[domain]['subdomains'][0]['domain']
    else:
        continue
    subs = set([entry["domain"] for entry in resolutions[domain]['subdomains'] if entry["domain"] != base])
    children[domain] = subs

In [24]:
children["cprapid.com."]

{'jfaetszlb-p66z6e10978r.51-81-32-91.cprapid.com.',
 '178-75-7-0.cprapid.com.',
 '13-126-211-91.cprapid.com.',
 '103-14-99-79.cprapid.com.',
 'mail.70-38-51-148.cprapid.com.',
 'www.13-250-26-93.cprapid.com.',
 '143-244-131-9.www.87-121-52-9.cprapid.com.',
 'readme.demo.demo.demo.demo.readme.readme.demo.readme.demo.readme.demo.demo.demo.demo.demo.mail.88-85-89-83.cprapid.com.',
 'mail.5-161-73-201.cprapid.com.',
 'readme.demo.readme.readme.readme.demo.readme.readme.demo.readme.readme.readme.readme.demo.whm.88-85-89-83.cprapid.com.',
 'conto-online-info.162-0-237-145.cprapid.com.',
 'root.35-88-77-251.cprapid.com.',
 'www.182-54-236-10.cprapid.com.',
 'sberbank.avito.avito.avito.blablacar.sber.sberbank.sber.2aid11imtmuy6q2.3-89-181-9.cprapid.com.',
 'readme.demo.readme.demo.readme.readme.readme.readme.readme.demo.readme.readme.readme.readme.readme.readme.demo.cpcontacts.88-85-89-83.cprapid.com.',
 'h8rkbs2jg4bzfg.161-35-142-24.cprapid.com.',
 'readme.demo.readme.readme.demo.readme.demo.

okay, one of those, cprapid.com, is *huge*. I suspect it's a wildcard zone. I.e. it's configured to answer with a valid response for *any* query, no matter how weird. That will be a huge pollutant in the analysis...need to decide how to handle that. For now, I think I'm just going to exclude any domain from the subdomain analysis that has more than 20 subdomains. 

It's also worth noting that some of the domains have no subdomains at all. That's possibly an interesting point to give back to the Threat Intel folks as well. 

Now with that done, let's move on to clustering on these subdomains.