# Overview

Analyzing `result_stats_20221129.json`

In [1]:
import json

f = open('result_stats.json', 'r')
lines = f.readlines()

results = []
for line in lines:
    results.append(json.loads(line))
    
total_num = len(results)
success_num = 0
failure_num = 0
for result in results:
    if result['is_success'] == True:
        success_num += 1
    else:
        failure_num += 1

print("Total url: %d" % total_num)
print("Success: %d Ratio: %f" % (success_num, success_num/total_num))
print("Failure: %d Ratio: %f" % (failure_num, failure_num/total_num))

Total url: 13382
Success: 5781 Ratio: 0.431998
Failure: 7601 Ratio: 0.568002


Since for each domain TimeMachine generates two urls (http://domain and https://domain) for screening, as long as one url succeeds we think this domain is processed successfully.

In [2]:
sorted_results = sorted(results, key=lambda keys: keys.get("domain"))
total_domain_num = len(sorted_results) / 2
success_domain_num = 0
failure_domain_num = 0
for i in range(int(total_domain_num)):
    http_result = sorted_results[2*i]
    https_result = sorted_results[2*i+1]
    if (http_result['is_success'] | https_result['is_success']) == True:
        success_domain_num += 1
    else:
        failure_domain_num += 1
        
print("Total domain: %d" % total_domain_num)
print("Success: %d Ratio: %f" % (success_domain_num, success_domain_num/total_domain_num))
print("Failure: %d Ratio: %f" % (failure_domain_num, failure_domain_num/total_domain_num))

Total domain: 6691
Success: 3677 Ratio: 0.549544
Failure: 3014 Ratio: 0.450456


The following are error types and counts.

In [3]:
err_msgs = {}
for result in results:
    if result['is_success'] == False:
        err_msg = result['err_message'].split('\n')[0]
        err_msgs[err_msg] = err_msgs.get(err_msg, 0) + 1

err_msgs = sorted(err_msgs.items(), key=lambda x:x[1], reverse=True)
for err_msg in err_msgs:
    print("%s\nTotal: %d\tRatio: %f" %(err_msg[0],err_msg[1], err_msg[1]/failure_num))

Got exception <class 'playwright._impl._api_types.TimeoutError'>: Timeout 10000ms exceeded.
Total: 5679	Ratio: 0.747139
Got exception <class 'playwright._impl._api_types.Error'>: NS_ERROR_CONNECTION_REFUSED
Total: 772	Ratio: 0.101566
Got exception <class 'playwright._impl._api_types.Error'>: SSL_ERROR_BAD_CERT_DOMAIN
Total: 369	Ratio: 0.048546
Got exception <class 'playwright._impl._api_types.Error'>: SSL_ERROR_UNKNOWN
Total: 207	Ratio: 0.027233
Got exception <class 'playwright._impl._api_types.Error'>: NS_ERROR_NET_RESET
Total: 196	Ratio: 0.025786
Got exception <class 'playwright._impl._api_types.Error'>: NS_ERROR_NET_INTERRUPT
Total: 155	Ratio: 0.020392
Got exception <class 'playwright._impl._api_types.Error'>: NS_ERROR_UNKNOWN_HOST
Total: 78	Ratio: 0.010262
Got exception <class 'playwright._impl._api_types.Error'>: SEC_ERROR_UNKNOWN_ISSUER
Total: 74	Ratio: 0.009736
Got exception <class 'playwright._impl._api_types.Error'>: SEC_ERROR_EXPIRED_CERTIFICATE
Total: 32	Ratio: 0.004210
Got 

# TimeoutError

Retry 100 urls of the following, 31 successes and 69 failures, which means the randomness

After **lengthening the default timeout**, 49 successes and 51 failures

After **using proxy**, that error message disappers, 66 successes and 34 failures

In [None]:
timeouterror_urls = []
for result in results:
    if "Got exception <class 'playwright._impl._api_types.TimeoutError'>: Timeout 10000ms exceeded." in result['err_message']:
        timeouterror_urls.append(result['result_basedir'].replace('_', '://'))
for url in timeouterror_urls:
    print(url)

# NS_ERROR_CONNECTION_REFUSED

Retry 100 urls from the following

After **using proxy**, this error message disappears, 22 successes and 78 failures

But new error messages `Got exception <class 'playwright._impl._api_types.Error'>: NS_ERROR_NET_INTERRUPT` appears 70 times

`Got exception <class 'playwright._impl._api_types.Error'>: NS_ERROR_NET_RESET` appears 6 times


In [None]:
connectionrefused_urls = []
for result in results:
    if "Got exception <class 'playwright._impl._api_types.Error'>: NS_ERROR_CONNECTION_REFUSED" in result['err_message']:
        connectionrefused_urls.append(result['result_basedir'].replace('_', '://'))
for url in connectionrefused_urls:
    print(url)

# SSL_ERROR_BAD_CERT_DOMAIN

Retry 100 urls from the following

After **editing playwright's context options** `ignoreHTTPSErrors = True`, this error message diappears, 85 successes and 15 failures


In [None]:
badcertdomain_urls = []
for result in results: 
    if "Got exception <class 'playwright._impl._api_types.Error'>: SSL_ERROR_BAD_CERT_DOMAIN" in result['err_message']:
        badcertdomain_urls.append(result['result_basedir'].replace('_', '://'))
for url in badcertdomain_urls:
    print(url)

# SSL_ERROR_UNKNOWN

Retry 100 urls of the following

After **editing playwright's context options** `ignoreHTTPSErrors = True`, 55 successes and 45 failures

When using chromium as playwright's browser, the error messages are more explicit, including `ERR_SSL_PROTOCOL_ERROR`, `ERR_SSL_VERSION_OR_CIPHER_MISMATCH`, `ERR_SSL_UNRECOGNIZED_NAME_ALERT`. 

Checking some failed urls manully with [Website Planet](https://www.websiteplanet.com/webtools/down-or-not/), all of them are DOWN across the globe. So I guess there's something wrong with these websites' SSL certification, then causing error.


In [None]:
sslunknown_urls = []
for result in results: 
    if "Got exception <class 'playwright._impl._api_types.Error'>: SSL_ERROR_UNKNOWN" in result['err_message']:
        sslunknown_urls.append(result['result_basedir'].replace('_', '://'))
for url in sslunknown_urls:
    print(url)

# NS_ERROR_NET_RESET

Retry 100 urls of the following, this error message disappears, 73 successes and 27 failures

In [None]:
netreset_urls = []
for result in results:
    if "Got exception <class 'playwright._impl._api_types.Error'>: NS_ERROR_NET_RESET" in result['err_message']:
        netreset_urls.append(result['result_basedir'].replace('_', '://'))
for url in netreset_urls:
    print(url)

# NS_ERROR_NET_INTERRUPT

Retry 100 urls of the following, 4 successes and 96 failures


Checking 20 urls manully with [Website Planet](https://www.websiteplanet.com/webtools/down-or-not/), all of them are DOWN across the globe. So I guess these error messages are caused by the website itself.

In [None]:
netinterrupt_urls = []
for result in results:
    if "Got exception <class 'playwright._impl._api_types.Error'>: NS_ERROR_NET_INTERRUPT" in result['err_message']:
        netinterrupt_urls.append(result['result_basedir'].replace('_', '://'))
for url in netinterrupt_urls:
    print(url)