**This notebook is a journal for different tests in the Teplitsa project CSRLab. The scripts in this notebooks were created clean the initial dataset and to collect the following data:**

* **Robots.txt, sitemap check**
* **Mobile friendliness test**
* **Social networks on the index page**
* **Looking up donation/help/bank account details on the index page**
* **Looking up the SSL-certificate of the website**
* **WCAG audit with `wcag-zoo`**

In [None]:
import requests, json, os 
from requests.exceptions import TooManyRedirects, ConnectionError, InvalidURL, ContentDecodingError, ReadTimeout, ChunkedEncodingError
headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}

from bs4 import BeautifulSoup

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

def strip_url(url):
  return url.replace('%20', '')

In [None]:
from google.colab import auth
auth.authenticate_user()

from google.colab import drive, files
drive.mount('/content/drive')

import pandas as pd

file = '/content/drive/My Drive/gryadka/gryadka_v1.csv'

Mounted at /content/drive


# Correcting websites' urls

`gryadka_v1.csv` - first version of the database of NGO's websites with NGO's names, unique identificators (`ogrn`), organisation forms, etc. Collected from [OpenNGO](https://openngo.ru/) database, ["Если быть точным" project](https://tochno.st/nko?check_params=is_verify#), [SPARK Interfax database](spark-interfax.ru), [Teplitsa data](https://te-st.ru/).

Since `gryadka_v1.csv` contains lots of company and personal data (see below), it is not made public but can be available on demand.

In [None]:
df = pd.read_csv(file)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22569 entries, 0 to 22568
Data columns (total 39 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ogrn                22569 non-null  int64  
 1   minjustRegNum       22266 non-null  object 
 2   regionName          22569 non-null  object 
 3   logo                0 non-null      float64
 4   statusDetail        22551 non-null  object 
 5   fullName            22569 non-null  object 
 6   dateReg             6191 non-null   object 
 7   minjustForm         21670 non-null  object 
 8   charter             0 non-null      float64
 9   minjustStatus       22311 non-null  object 
 10  opf                 22569 non-null  object 
 11  oktmo               22551 non-null  object 
 12  egrulStatus         22551 non-null  object 
 13  mainOkved           22566 non-null  object 
 14  regionCode          22551 non-null  float64
 15  incomeTotal         22551 non-null  float64
 16  emai

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
list(zip(df.ogrn.tolist(), df.website.tolist()))[:40]

[(1207700237084, 'burningheart-charity.ru'),
 (1203400008460, 'nkosocium.ru'),
 (1203400005390, 'rck-vlg.ru, rtsk-vlg.ru'),
 (1207700444764, 'intellect-foundation.ru, фонд-интеллект.рф'),
 (1207700136852, 'sabf.ru'),
 (1207800101530, 'asp-rpo.ru, sro-rpo.ru'),
 (1206100024019, 'msprnd.ru'),
 (1207700234268, 'ano-consensus.ru'),
 (1207800033737, ' https://cardiomama-ano.ru/'),
 (1202600014990, 'specprofi26.ru'),
 (1202300041624, 'dorogisochi.ru'),
 (1207700228560, 'https://xn--90aiaudcsdq4i.xn--d1acj3b/'),
 (1207700220221, 'fondserdcarodiny.com'),
 (1203400007183, 'poelizdorovo.ru'),
 (1204700005971, 'рбоонадежда47регион.рф'),
 (1207400039703, 'fpscho.ru'),
 (1207700162713, 'securitymedia.ru'),
 (1207800044385, 'anornim.ru, rnim.ru'),
 (1200200004520, '0312gov.ru, atpprf.ru, nkoapo.ru'),
 (1201600046482, 'ecology-tatarstan.ru'),
 (1207700479106, 'sistema-university.ru'),
 (1206700011891, 'rosekzamen.ru'),
 (1206100022810, 'http://rak-pobedim.com'),
 (1205000052245, 'dvbfond.ru'),
 (1205

Websites are presented in different forms, some cases of multiple websites per index. 

Also there're examples of third-party website pages. Examples:
- https://dobro.ru/organizations/1565552/info
- http://admtmo.ru/sfery-deyatelnosti/malyy-biznes/obshchestvennye-organizatsii/topkinskoe-gorodskoe-otdelenie-vserossiyskoy-obshchestvennoy-organizatsii-veteranov-pensionerov-voyn/
- http://sonko.samregion.ru/node/49

These are expluded from the analysis.

In [None]:
def check_http(website):
  if 'http' in website:
    return website.strip() # some urls have spaces in them
  else:
    return 'http://%s' % website.strip() # requests lib will be used later, if website uses 'https' protocol, will redirect

def check_multiple(website):
  if ',' in website:
    urls = website.split(',')
    return [check_http(u) for u in urls]
  else:
    return [check_http(website)]

def third_party_page(website):
  if len(website.split('//')) > 1:
    if len(website.split('//')[1].split('/')) > 1: # has subpages
      return False
    else:
      return True
  else:
    return True

Collecting corrected websites df

In [None]:
websites = []
for ogrn, website in list(zip(df.ogrn.tolist(), df.website.tolist())):
  if third_party_page(website):
    urls = check_multiple(website)
    for u in urls:
      websites.append({'ogrn' : ogrn, 'website' : u})

Total number of correct website addresses:

In [None]:
len(websites) 

22298

In [None]:
websites[:10]

[{'ogrn': 1207700237084, 'website': 'http://burningheart-charity.ru'},
 {'ogrn': 1203400008460, 'website': 'http://nkosocium.ru'},
 {'ogrn': 1203400005390, 'website': 'http://rck-vlg.ru'},
 {'ogrn': 1203400005390, 'website': 'http://rtsk-vlg.ru'},
 {'ogrn': 1207700444764, 'website': 'http://intellect-foundation.ru'},
 {'ogrn': 1207700444764, 'website': 'http://фонд-интеллект.рф'},
 {'ogrn': 1207700136852, 'website': 'http://sabf.ru'},
 {'ogrn': 1207800101530, 'website': 'http://asp-rpo.ru'},
 {'ogrn': 1207800101530, 'website': 'http://sro-rpo.ru'},
 {'ogrn': 1206100024019, 'website': 'http://msprnd.ru'}]

Checking whether website still exists (takes long time):

**addition 1:**

`UnicodeError` raised with `http://moo-spzh.ru` - it redirects to `союзправославныхженщин.рф`. Not obvious why exception is raised. Added to exception log for such errors.

**addition 2:**

`ContentDecodingError` raised while testing - added to exceptions

`{'ogrn': 1191513000568, 'website': 'http://komsomolosetii.ru'}`

`{'ogrn': 1191513000568, 'website': 'http://nasledie-osetii.ru'}`

`{'ogrn': 1037739764094, 'website': 'http://www.rfcda.ru'}`

`{'ogrn': 1025700787210, 'website': 'http://hospisorel.ru'}`

`{'ogrn': 1155200002011, 'website': 'http://vektor-kstovo.ru'}`

`{'ogrn': 1167700060990, 'website': 'http://www.kursydipacademy.ru'}`

**addition 3:**

`http://www.vog.su` processed very long. `timeout = 10` parameter added to timeout after 10 seconds so the loop doesn't get stuck, `ReadTimeout` exception.

In [None]:
websites_checked, unicode_error_log = [], []

for i, u in enumerate(websites):
  try:
    if requests.get(u['website'], verify = False, timeout = 10).status_code == 200:
      u['i'] = i  # for log purposes
      websites_checked.append(u)
  except (ConnectionError, InvalidURL, TooManyRedirects, ReadTimeout):
    pass
  except ContentDecodingError:
    print(u)
  except UnicodeError:
    unicode_error_log.append(u)

{'ogrn': 1025700787210, 'website': 'http://hospisorel.ru'}
{'ogrn': 1155200002011, 'website': 'http://vektor-kstovo.ru'}
{'ogrn': 1167700060990, 'website': 'http://www.kursydipacademy.ru'}


Saving results

In [None]:
pd.DataFrame(websites_checked)[['ogrn', 'website']].to_csv('websites_checked.csv', index = False)
!cp websites_checked.csv "/content/drive/My Drive/gryadka/"

Reading from file

In [None]:
websites_checked_df = pd.read_csv("/content/drive/My Drive/gryadka/websites_checked.csv")
websites_checked_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16494 entries, 0 to 16493
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ogrn     16494 non-null  int64 
 1   website  16494 non-null  object
dtypes: int64(1), object(1)
memory usage: 257.8+ KB


Last version

In [None]:
websites_checked_df = pd.read_csv("/content/drive/My Drive/gryadka/2021_lab_websites_checked_v3.csv")
websites_checked_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15386 entries, 0 to 15385
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ogrn     15386 non-null  int64 
 1   website  15386 non-null  object
dtypes: int64(1), object(1)
memory usage: 240.5+ KB


Check how many left from the inicial set

In [None]:
len(websites_checked_df) / len(websites)

0.7397075970939098

Select 10 random websites from list for tests

In [None]:
from random import randrange
websites_random = []

for i in range(10):
  websites_random.append(websites_checked_df.website.tolist()[randrange(len(websites_checked_df))])

In [None]:
websites_random

['http://avto-viraj.ru',
 'http://казанский-собор.рф',
 'http://ruszhuravka.ru',
 'http://trinity-averkievo.ru',
 'http://radchenko-ballet.com',
 'http://redcross-mosuvao.ru',
 'http://oppo-rnt.ru',
 'http://redcross.tomsk.ru',
 'http://hrhi.ru',
 'http://fondbezgranits.ru']

# Robots.txt, sitemap check

FAQ on [sitemap.org](https://www.sitemaps.org/faq.html#faq_compression)

For more details on the structure of `robots.txt` files, see http://www.robotstxt.org/orig.html.

`reppy` for parsing robots.txt file https://github.com/seomoz/reppy

In [None]:
! pip install reppy

from reppy.robots import Robots
from reppy.exceptions import BadStatusCode, ExcessiveRedirects, ConnectionException

Collecting reppy
  Downloading reppy-0.4.14.tar.gz (93 kB)
[?25l[K     |███▌                            | 10 kB 27.0 MB/s eta 0:00:01[K     |███████                         | 20 kB 24.1 MB/s eta 0:00:01[K     |██████████▌                     | 30 kB 11.3 MB/s eta 0:00:01[K     |██████████████                  | 40 kB 9.0 MB/s eta 0:00:01[K     |█████████████████▌              | 51 kB 5.2 MB/s eta 0:00:01[K     |█████████████████████           | 61 kB 5.7 MB/s eta 0:00:01[K     |████████████████████████▌       | 71 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████████    | 81 kB 6.1 MB/s eta 0:00:01[K     |███████████████████████████████▌| 92 kB 4.7 MB/s eta 0:00:01[K     |████████████████████████████████| 93 kB 1.4 MB/s 
Building wheels for collected packages: reppy
  Building wheel for reppy (setup.py) ... [?25l[?25hdone
  Created wheel for reppy: filename=reppy-0.4.14-cp37-cp37m-linux_x86_64.whl size=794710 sha256=561f315edd2b0ae865b459422053f8e58027cc69c

In [None]:
sitemap_files = ['sitemap.xml', 'sitemap.xml.gz', 'sitemap', 'xmlsitemap', 
                 'sitemap_index.xml', 'sitemap_index.xml.gz', '.sitemap.xml', 'sitemap-index.xml',
                 'sitemap-index.xml.gz', 'sitemap/sitemap-index.xml'] # possible names of the sitemap page

def check_status_code(url):
  if requests.get(url, verify = False, headers = headers).status_code > 200:
    return False
  else:
    return True

def sitemap_in_robots(website):
  robots = Robots.fetch('%s/robots.txt' % website, verify = False, headers = headers)
  return robots.sitemaps

def collect_bots(website):
  robots_txt = check_status_code('%s/robots.txt' % website)
  if robots_txt: # checking sitemap path in robots.txt
    if len(sitemap_in_robots(website)) > 0:
      sitemap = True
      is_sitemap_in_robots = True
      return {'url': website, 'robots_txt': robots_txt, 'sitemap_page': sitemap, 'is_sitemap_in_robots' : is_sitemap_in_robots}
    else:
      sitemap = False
      is_sitemap_in_robots = False
  else:
    robots_txt = False
    sitemap = False
    is_sitemap_in_robots = False
  # trying to find sitemap file not in robots.txt
  for i in sitemap_files:
    try:
      if check_status_code('%s/%s' % (website, i)):
        sitemap = True
    except TooManyRedirects:
      pass
  return {'url': website, 'robots_txt': robots_txt, 'sitemap_page': sitemap, 'is_sitemap_in_robots' : is_sitemap_in_robots}

In [None]:
collect_bots(websites_random[5]['website'])

{'is_sitemap_in_robots': False,
 'robots_txt': True,
 'sitemap_page': True,
 'url': 'http://rejdu.ru'}

In [None]:
os.listdir("/content/drive/My Drive/gryadka/")

['gryadka_v1.csv',
 'websites_checked.csv',
 'mobile_friendly_1.json',
 'mobile_friendly_log.json',
 'social_1.csv',
 'social_log_1.json',
 'social_1.gsheet',
 '2021_lab_websites_checked_v2.csv',
 '2021_lab_websites_checked_error_log.csv',
 '2021_lab_websites_checked_v3.csv',
 'robots',
 '2021_lab_sitemap_robots_txt_check.csv']

In [None]:
frames_df = []
for f in ['collect_bots.csv', 'collect_bots_3.csv', 'collect_bots_2.csv', 'collect_bots_3_2.csv',
          'collect_bots_3_3.csv', 'collect_bots_3_4.csv']:
          frames_df.append(pd.read_csv("/content/drive/My Drive/gryadka/"+f))

In [None]:
bots = pd.concat(frames_df)
bots.columns = ['website', 'robots_txt',	'sitemap_page',	'is_sitemap_in_robots']
bots.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20650 entries, 0 to 2317
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   website               20650 non-null  object
 1   robots_txt            20650 non-null  bool  
 2   sitemap_page          20650 non-null  bool  
 3   is_sitemap_in_robots  20650 non-null  bool  
dtypes: bool(3), object(1)
memory usage: 383.2+ KB


In [None]:
bots.merge(websites_checked_df, on = 'website').drop_duplicates(subset = ['website']).to_csv("/content/drive/My Drive/gryadka/" + '2021_lab_sitemap_robots_txt_check.csv', index = False)

In [None]:
bots.merge(websites_checked_df, on = 'website').drop_duplicates(subset = ['website']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14953 entries, 0 to 18956
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   website               14953 non-null  object
 1   robots_txt            14953 non-null  bool  
 2   sitemap_page          14953 non-null  bool  
 3   is_sitemap_in_robots  14953 non-null  bool  
 4   ogrn                  14953 non-null  int64 
dtypes: bool(3), int64(1), object(1)
memory usage: 394.3+ KB


In [None]:
soc = pd.concat([pd.read_csv("/content/drive/My Drive/gryadka/" + 'social_1.csv'), pd.read_csv("/content/drive/My Drive/gryadka/" + 'social_2.csv')]).drop_duplicates(subset= 'url')
soc.columns = ['website', 'fb', 'vk', 'ig', 'ok', 'youtube', 'tiktok']
soc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6000 entries, 0 to 2409
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   website  6000 non-null   object
 1   fb       1649 non-null   object
 2   vk       2108 non-null   object
 3   ig       1667 non-null   object
 4   ok       499 non-null    object
 5   youtube  756 non-null    object
 6   tiktok   30 non-null     object
dtypes: object(7)
memory usage: 375.0+ KB


In [None]:
soc.merge(websites_checked_df, on = 'website').drop_duplicates(subset = ['website']).to_csv("/content/drive/My Drive/gryadka/" + '2021_lab_social_networks_check.csv', index = False)

In [None]:
def select(js_line):
  return {k: v for k, v in js_line.items() if k in ['mobileFriendliness', 'website']}

frames = []
with open("/content/drive/My Drive/gryadka/"+'mobile_friendly_1.json') as f:
    data = json.load(f)
for l in data:
  frames.append(select(l))

with open("/content/drive/My Drive/gryadka/"+'mobile_friendly.json') as f:
    data = json.load(f)
for l in data:
  frames.append(select(l))

In [None]:
pd.DataFrame(frames).drop_duplicates(subset = ['website']).merge(websites_checked_df, on = 'website').to_csv("/content/drive/My Drive/gryadka/" + '2021_lab_mobile_friendliness_check.csv', 
                                                                                                             index = False)

In [None]:
don = pd.read_csv("/content/drive/My Drive/gryadka/"+'donations.csv')
don.columns = ['website', 'account',	'donation',	'help_page']
don.drop_duplicates(subset=['website']).merge(websites_checked_df, on = 'website').to_csv("/content/drive/My Drive/gryadka/" + '2021_lab_donations_check.csv', 
                                                                                                             index = False)

**Collecting data**

**add 1**: `ConnectionError: ('Connection aborted.'` for `http://bashterra.ru`. Exception added to log

**add 2**: `BadStatusCode: ('Got 502 for http://prtk.ru/robots.txt', 502)`. Exception added to log, import from `reppy`

**add 3**: `ExcessiveRedirects: Exceeded 30 redirects.` Exception added to log, import from `reppy`

**add 4**: `ConnectionException.` Exception added to log, import from `reppy`

In [None]:
#frames, log = [], []

for u in websites_checked_df.website.tolist()[10000 +759+74+ 3200 + 99+13 +499  :]:
    if len(frames) > 50:
        pd.DataFrame(frames).to_csv('collect_bots_3_4.csv', index = False)
        !cp collect_bots_3_4.csv "/content/drive/My Drive/gryadka/"
        with open("/content/drive/My Drive/gryadka/robots_log_3_4.json", 'w') as f:
            json.dump({'log': log}, f)
    try:
        frames.append(collect_bots(u))
    except (ConnectionError, UnicodeError, TooManyRedirects, BadStatusCode, ExcessiveRedirects, ConnectionException, ReadTimeout, ChunkedEncodingError):
        log.append(u)

In [None]:
pd.DataFrame(frames).to_csv('collect_bots_4.csv', index = False)
!cp collect_bots_4.csv "/content/drive/My Drive/gryadka/"

In [None]:
pd.read_csv('/content/drive/My Drive/gryadka/collect_bots_3_3.csv')

Unnamed: 0,url,robots_txt,sitemap_page,is_sitemap_in_robots
0,http://academy-communication.ru,True,True,True
1,http://fond-dt.ru,True,True,False
2,http://socinnovations.ru,True,True,True
3,http://grand-e.ru,True,False,False
4,http://sports-trio.ru,True,True,True
...,...,...,...,...
594,http://musicpremia.ru,True,False,False
595,http://bfr-mdhsh.ru,True,True,True
596,http://бфр-мдхш.рф,True,True,True
597,http://восхождение.рус,True,True,True


# Mobile friendly test

Google Search Console APIs > URL Testing Tools API (Beta) > [Runs Mobile-Friendly Test for a given URL](https://developers.google.com/webmaster-tools/search-console-api/reference/rest/v1/urlTestingTools.mobileFriendlyTest/run)

**Demands Google Developer API key.**

In [None]:
url = 'https://searchconsole.googleapis.com/v1/urlTestingTools/mobileFriendlyTest:run'
api_key = API_KEY # the key deleted for security purposes

In [None]:
def check_mobile_friendly(website):
  params = {'url': website,
          'requestScreenshot': 'false',
          'key': api_key}
  x = requests.post(url, data = params)
  data = json.loads(x.text)
  data['website'] = website
  return data

In [None]:
check_mobile_friendly(websites_random[8]['website'])

{'mobileFriendliness': 'MOBILE_FRIENDLY',
 'resourceIssues': [{'blockedResource': {'url': 'https://connect.ok.ru/connect.js'}},
  {'blockedResource': {'url': 'https://mc.yandex.ru/metrika/advert.gif?t=ti(4)'}},
  {'blockedResource': {'url': 'https://mc.yandex.ru/watch/57548695?callback=_ymjsp794643707&page-url=https%3A%2F%2Fpomozhem-detyam.ru%2F&charset=utf-8&browser-info=pv%3A1%3Agdpr%3A14%3Avf%3A25rt5q1nhcb5k4y7at%3Afu%3A0%3Aen%3Autf-8%3Ala%3Aen-US%3Av%3A675%3Acn%3A1%3Adp%3A0%3Als%3A1339183188023%3Ahid%3A444351026%3Az%3A-420%3Ai%3A202101022071615%3Aet%3A1634912175%3Ac%3A1%3Arn%3A122965040%3Arqn%3A1%3Au%3A16349121751044682726%3Aw%3A412x732%3As%3A412x732x24%3Ask%3A2.625%3Antf%3A1%3Ans%3A1634912175000%3Ads%3A0%2C0%2C2%2C0%2C9%2C0%2C%2C40%2C0%2C%2C%2C%2C40%3Adsn%3A0%2C0%2C%2C0%2C10%2C0%2C%2C28%2C0%2C%2C%2C%2C40%3Awv%3A2%3Arqnl%3A1%3Ast%3A1634912175%3At%3A%D0%91%D0%BB%D0%B0%D0%B3%D0%BE%D1%82%D0%B2%D0%BE%D1%80%D0%B8%D1%82%D0%B5%D0%BB%D1%8C%D0%BD%D1%8B%D0%B9%20%D1%84%D0%BE%D0%BD%D0%B4%20%27

In [None]:
check_mobile_friendly(websites_random[2]['website'])

{'mobileFriendliness': 'MOBILE_FRIENDLY',
 'testStatus': {'status': 'COMPLETE'},
 'website': 'http://art-nevagrad.ru'}

# Social networks on the index page

Looking up links to social networks on the home page.

In [None]:
def get_social_networks(website):
  page = BeautifulSoup(requests.get(website, verify = False).text)
  fb, vk, ig, ok, youtube = '', '', '', '', ''
  for a in page.find_all('a'):
    try:
      href = a['href']
      if 'facebook.com/' in href:
        fb = strip_url(href)
      if 'vk.com/' in href:
        vk = strip_url(href)
      if 'instagram.com/' in href:
        ig = strip_url(href)
      if 'ok.ru/' in href:
        ok = strip_url(href)
      if 'youtube.com/channel/' in href:
        youtube = strip_url(href)
      if 'tiktok' in href:
        tiktok = strip_url(href)
    except KeyError:
      pass
  return {'url': website, 'fb': fb, 'vk': vk, 'ig': ig, 'ok': ok, 'youtube': youtube, 'tiktok': tiktok}

In [None]:
get_social_networks(websites_random[9]['website'])

{'fb': '',
 'ig': '',
 'ok': '',
 'url': 'http://окп-123.рф',
 'vk': '',
 'youtube': ''}

In [None]:
get_social_networks(websites_random[2]['website'])

{'fb': 'https://www.facebook.com/artnevagradspb/',
 'ig': 'https://www.instagram.com/artnevagrad/',
 'ok': '',
 'url': 'http://art-nevagrad.ru',
 'vk': '',
 'youtube': ''}

In [None]:
get_social_networks(websites_random[3]['website'])

{'fb': 'https://facebook.com/ourfutureru',
 'ig': 'https://www.instagram.com/ourslon',
 'ok': '',
 'url': 'http://ourfuture.ru',
 'vk': 'https://vk.com/ourfutureru',
 'youtube': ''}

# Looking up donation/help/bank account details on the index page

In [None]:
def donations_first_page(website):
  page = BeautifulSoup(requests.get(website, verify = False).text)
  donation, account, help = False, False, False
  if 'реквизиты' in page.text.lower():
    account = True
  if 'пожертво' in page.text.lower():
    donation = True
  if 'помочь' in page.text.lower():
    help = True
  return {'url': website, 'account': account, 'donation': donation, 'help_page' : help}

In [None]:
donations_first_page(websites_random[3]['website'])

{'account': False,
 'donation': False,
 'help_page': False,
 'url': 'http://ourfuture.ru'}

In [None]:
donations_first_page(websites_random[5]['website'])

{'account': False,
 'donation': True,
 'help_page': True,
 'url': 'http://rejdu.ru'}

# SSL-certificate

Requirements:
- valid
- self-subscribed

Common errors found upon testing: 

- `ConnectionRefusedError / SSLError: [SSL: WRONG_VERSION_NUMBER]` - error occurs if port 443 not opened on server
- `SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate`
- `SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for %website%`
- `SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate`
- `SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired`

In [None]:
import ssl, socket, datetime, sys
from ssl import SSLCertVerificationError

In [None]:
def ssl_expiry_datetime(website, port = 443): # port for https / http connection
    host = website.replace('http://', '')
    context = ssl.create_default_context()
    conn = context.wrap_socket(
        socket.socket(socket.AF_INET),
        server_hostname = host,
    )
    # 10 second timeout because Lambda has runtime limitations
    conn.settimeout(10)
    try:
      conn.connect((host, port))
      ssl_info = conn.getpeercert()
      ssl_info = {new : ssl_info[new] for new in ['issuer', 'notAfter', 'notBefore']} # only select relevant keys
      ssl_info['website'] = website
      ssl_info['error'] = 'No error'
      return ssl_info
    except (SSLCertVerificationError, ConnectionRefusedError) as e:
      exc_type, value, traceback = sys.exc_info()
      return {'website' : website, 'error' : '%s : %s' % (exc_type.__name__, value)}


In [None]:
ssl_expiry_datetime(websites_random[0])

{'error': 'No error',
 'issuer': ((('countryName', 'US'),),
  (('organizationName', "Let's Encrypt"),),
  (('commonName', 'R3'),)),
 'notAfter': 'Feb  7 00:00:22 2022 GMT',
 'notBefore': 'Nov  9 00:00:23 2021 GMT',
 'website': 'http://avto-viraj.ru'}

In [None]:
ssl_expiry_datetime(websites_random[4])

{'error': 'SSLCertVerificationError : [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1091)',
 'website': 'http://radchenko-ballet.com'}

In [None]:
ssl_expiry_datetime(websites_random[2])

{'error': 'ConnectionRefusedError : [Errno 111] Connection refused',
 'website': 'http://ruszhuravka.ru'}

In [None]:
ssl_expiry_datetime(websites_random[6])

{'error': "SSLCertVerificationError : [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'oppo-rnt.ru'. (_ssl.c:1091)",
 'website': 'http://oppo-rnt.ru'}

In [None]:
ssl_expiry_datetime(websites_random[8])

{'error': 'SSLCertVerificationError : [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)',
 'website': 'http://hrhi.ru'}

# Web Accessibility Evaluation

[Web Accessibility Evaluation Tools List](https://www.w3.org/WAI/ER/tools/)

[Techniques for WCAG 2.0](https://www.w3.org/TR/WCAG20-TECHS/general.html)

[`pa11y`](https://bitsofco.de/pa11y/), pronounced pally, is a set of free and open source tools that aims to make designing and developing accessibility easier. 

[WCAG Zoo](https://wcag-zoo.readthedocs.io/_/downloads/en/latest/pdf/) - Scripts for automated accessiblity validation
- WCAG guideline index and validator [reference](https://wcag-zoo.readthedocs.io/en/latest/wcag.html)

- Достаточно ли нам показателей?
- Проверяем ли мы только одну (домашнюю) страницу?

In [None]:
!pip install wcag-zoo

Collecting wcag-zoo
  Downloading wcag-zoo-0.2.6.tar.gz (20 kB)
Collecting premailer
  Downloading premailer-3.10.0-py2.py3-none-any.whl (19 kB)
Collecting webcolors
  Downloading webcolors-1.11.1-py3-none-any.whl (9.9 kB)
Collecting xtermcolor
  Downloading xtermcolor-1.3.tar.gz (3.8 kB)
Collecting cssselect
  Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting cssutils
  Downloading cssutils-2.3.0-py3-none-any.whl (404 kB)
[K     |████████████████████████████████| 404 kB 6.5 MB/s 
Building wheels for collected packages: wcag-zoo, xtermcolor
  Building wheel for wcag-zoo (setup.py) ... [?25l[?25hdone
  Created wheel for wcag-zoo: filename=wcag_zoo-0.2.6-py2.py3-none-any.whl size=21405 sha256=30f27595f7d57831d510cc52e6e5dc501ec9bf0af594005585bea7469f11178a
  Stored in directory: /root/.cache/pip/wheels/c3/df/05/71dd1ba2a7ac600838e75d70a0ac07ec273f0954e5abfa0277
  Building wheel for xtermcolor (setup.py) ... [?25l[?25hdone
  Created wheel for xtermcolor: filename=xt

In [None]:
import wcag_zoo
from wcag_zoo.validators.molerat import Molerat
from wcag_zoo.validators.tarsier import Tarsier
from wcag_zoo.validators.ayeaye import Ayeaye

In [None]:
html = BeautifulSoup(requests.get(websites_random[0]).text, 'lxml')

In [None]:
html.prettify()

'<!DOCTYPE html>\n<html lang="ru-RU">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <link href="https://gmpg.org/xfn/11" rel="profile"/>\n  <input id="_wpnonce" name="_wpnonce" type="hidden" value="f80f8a8ba7"/>\n  <input name="_wp_http_referer" type="hidden" value="/"/>\n  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>\n  <!-- This site is optimized with the Yoast SEO plugin v16.1.1 - https://yoast.com/wordpress/plugins/seo/ -->\n  <title>\n   Главная - Автошкола "Вираж"\n  </title>\n  <meta content=\'Автошкола "Вираж" ПОЛУЧИТЕ ПРАВА ЗА 2.5 МЕСЯЦАБЕЗ ЛИШНИХ РАСХОДОВ И ПЕРЕПЛАТ Записаться в автошколу О нас Наша автошкола успешно работает уже 10 лет. За\' name="description"/>\n  <link href="https://avto-viraj.ru/" rel="canonical"/>\n  <meta content="ru_RU" property="og:locale"/>\n  <meta content="website" property="og:type"/>\n  <meta content=\'Главная - Автошк

In [None]:
validator = Molerat(
    level="AA",
    media_rules=['max-width: 600px'],
    skip_these_classes=["sneaky"]
        ).validate_document(html.prettify().encode())

In [None]:
validator['failures']['1.4.3'].keys()

dict_keys(['G18', 'G145'])

In [None]:
validator.keys()



In [None]:
instance = Tarsier()
results = instance.validate_document(html.prettify().encode())

In [None]:
results['failures']['1.3.1'].keys()

dict_keys(['H42'])

In [None]:
results['failures']['1.3.1']['H42']

[{'classes': 'elementor-heading-title elementor-size-default',
  'error_code': 'tarsier-1',
  'guideline': '1.3.1',
  'id': None,
  'message': 'Incorrect header found at /html/body/div[1]/div/div/div/main/article/div/div/div/section[12]/div[2]/div/div/section[1]/div/div[1]/div/div[2]/div/h4 - H4 should be H3, text in header was \n                      Минимальные сроки обучения\n                     ',
  'technique': 'H42',
  'xpath': '/html/body/div[1]/div/div/div/main/article/div/div/div/section[12]/div[2]/div/div/section[1]/div/div[1]/div/div[2]/div/h4'}]

In [None]:
instance = Ayeaye()
results = instance.validate_document(html.prettify().encode())

In [None]:
results

{'failures': {},
 'skipped': {},
 'success': {},
     'guideline': '2.1.1',
     'id': None,
     'message': 'No `accesskey` attributes found, consider adding some to improve keyboard accessibility',
     'technique': 'G202',
     'xpath': '/html/body'}]}}}

`wcag_zoo` was not used in the final version of tests, switched to Google Lighthouse API