# Analysis of sites' WHOIS registration dates

The notebook below collects and analyzes the registration dates of the website domains for this project's list of sites.

The first section standardizes the site URLs to generate a list of unique domains, excluding domains belonging to blogging platforms.

The second section aggregates the WHOIS data for those domains by year, month, and political affiliation.

In [1]:
import pandas as pd
import re

## Creating the list of domains

### Loading the sites and converting URLs to domains

In [2]:
def site_to_domain(site):
    if pd.isnull(site): return None
    if re.match(r"^http", site):
        site = re.search("https?://([^/]+)", site).group(1)
    site = re.sub(r"/.*$", "", site)
    return ".".join(site.split(".")[-2:]).lower()

In [3]:
all_sites = pd.read_csv("../data/all-partisan-sites.csv")
all_sites["site"] = all_sites["site"].str.strip().str.lower()
all_sites["domain"] = all_sites["site"].apply(site_to_domain)
all_domains = all_sites["domain"].dropna().unique()
len(all_domains)

665

### Ignoring three domains from the list that belong to blogging platforms

In [4]:
DOMAINS_TO_IGNORE = [
    "blogspot.com",
    "blogspot.ca",
    "wordpress.com",
]

In [5]:
all_sites[
    all_sites["domain"].isin(DOMAINS_TO_IGNORE)
][[ "site", "domain", "political_category" ]]

Unnamed: 0,site,domain,political_category
4,4threvolutionarywar.wordpress.com,wordpress.com,left
144,counterinformation.wordpress.com,wordpress.com,left
362,newzeal.blogspot.com,blogspot.com,right
492,sharpelbowsstl.blogspot.ca,blogspot.ca,right
534,theimmoralminority.blogspot.ca,blogspot.ca,left


### Generating a list of unique domains

In [6]:
unique_domains = list(sorted(set(all_domains) - set(DOMAINS_TO_IGNORE)))
unique_domains[:5]

['100percentfedup.com',
 '21stcenturywire.com',
 '24dailynew.com',
 '24usnews.com',
 '63red.com']

In [7]:
with open("../output/unique-domains.txt", "w") as f:
    f.write("\n".join(unique_domains))

## Fetching WHOIS data for the domains

The domains above were then submitted to DomainTools' [Bulk Parsed Whois service](https://research.domaintools.com/bulk-parsed-whois/), which produced the file at [`data/domaintools-whois-results.csv`](../data/domaintools-whois-results.csv).

## Analyzing the domain registration dates

The main columns of interest in the WHOIS data are the domain and the WHOIS registration date:

In [8]:
whois = pd.read_csv("../data/domaintools-whois-results.csv", parse_dates=[ "create date" ])\
    .rename(columns={
        "create date": "date_registered"
    })
whois.head()[[ "domain", "date_registered" ]]

Unnamed: 0,domain,date_registered
0,100percentfedup.com,2012-03-13
1,21stcenturywire.com,2009-11-03
2,24dailynew.com,2017-02-21
3,24usnews.com,2016-07-03
4,63red.com,2011-12-05


### Merging the WHOIS data with the manually-collected political affiliations

In [9]:
registration_dates = pd.merge(
    all_sites[[ "site", "domain", "political_category" ]],
    whois[[ "domain", "date_registered" ]],
    on="domain",
    how="left"
).drop_duplicates(subset=["site"])\
    .sort_values([ "date_registered", "site" ])

registration_dates["year_registered"] = registration_dates["date_registered"].dt.strftime("%Y")
registration_dates["month_registered"] = registration_dates["date_registered"].dt.strftime("%Y-%m")
registration_dates["date_registered"] = registration_dates["date_registered"].dt.strftime("%Y-%m-%d")

registration_dates = registration_dates.replace("NaT", pd.np.nan)

registration_dates.head()

Unnamed: 0,site,domain,political_category,date_registered,year_registered,month_registered
4,4threvolutionarywar.wordpress.com,wordpress.com,left,,,
144,counterinformation.wordpress.com,wordpress.com,left,,,
362,newzeal.blogspot.com,blogspot.com,right,,,
492,sharpelbowsstl.blogspot.ca,blogspot.ca,right,,,
534,theimmoralminority.blogspot.ca,blogspot.ca,left,,,


In [10]:
registration_dates.to_csv("../output/whois-registration-dates.csv", index=False)

### Checking for sites for which we don't have WHOIS registration dates

Good sign: They are only those for which we have, above, explictly excluded the domains.

In [11]:
registration_dates[
    registration_dates["date_registered"].isnull()
]

Unnamed: 0,site,domain,political_category,date_registered,year_registered,month_registered
4,4threvolutionarywar.wordpress.com,wordpress.com,left,,,
144,counterinformation.wordpress.com,wordpress.com,left,,,
362,newzeal.blogspot.com,blogspot.com,right,,,
492,sharpelbowsstl.blogspot.ca,blogspot.ca,right,,,
534,theimmoralminority.blogspot.ca,blogspot.ca,left,,,


### Aggregating registration dates by year, month, and political affiliation

In [12]:
annual_registrations = registration_dates\
    .dropna(subset=["date_registered"])\
    .groupby([ "year_registered", "political_category" ])\
    .size()\
    .unstack().fillna(0).astype(int)\
    .assign(total=lambda x: x.sum(axis=1))

annual_registrations.to_csv("../output/whois-registration-counts-annual.csv")
annual_registrations.head()

political_category,left,right,total
year_registered,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1994,2,1,3
1995,7,10,17
1996,2,5,7
1997,3,5,8
1998,5,10,15


In [13]:
monthly_registrations = registration_dates\
    .dropna(subset=["date_registered"])\
    .groupby([ "month_registered", "political_category" ])\
    .size()\
    .unstack().fillna(0).astype(int)\
    .assign(total=lambda x: x.sum(axis=1))
    
monthly_registrations.to_csv("../output/whois-registration-counts-monthly.csv")
monthly_registrations.head()

political_category,left,right,total
month_registered,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1994-11,0,1,1
1994-12,2,0,2
1995-02,1,0,1
1995-03,0,1,1
1995-04,1,1,2


---

---

---