# Adding NGOs we are researching to web archive

Web archive allows to make snapshots of a web page and save it in its current state. We want to save all pages of each NGO website as a snapshot (also called memento in official docs https://archive.readme.io/docs/creating-a-snapshot). The snapshots are stored indefinetely, however, if the official owner of the website formally asks, a snapshot can be deleted from the web archive, although I suspect that not many people generally know about this.

First step we load the data

### Data

In [1]:
import pandas as pd
import requests, json, os
from json.decoder import JSONDecodeError
import json

In [2]:
data_url = "https://raw.githubusercontent.com/Teplitsa/CSRLab/main/data/2022_lab_index_report.csv"
index_report_data = pd.read_csv(data_url, encoding='utf-8')
index_report_data.tail()

Unnamed: 0,ogrn,website,bin_robots,bin_sitemap,bin_fb,bin_vk,bin_ig,bin_ok,bin_youtube,bin_tiktok,bin_donation,wcag_score,mean_page_speed,bin_title,bin_headings,bin_mob_friendly,bin_ssl,bin_socnet
9540,1217700385539,ngolikeyou.ru,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.884615,0.97,1.0,0.0,1.0,,0.666667
9541,1214600009524,kcmol.ru,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.8125,0.573333,1.0,1.0,1.0,0.0,0.333333
9542,1217700396330,gorneks.ru,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.934783,0.78,1.0,0.0,,1.0,0.0
9543,1213300006920,ano-nachalo.ru,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.896552,0.996667,1.0,1.0,,0.0,0.0
9544,1212300052249,blagfond-zzh.ru,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.413333,1.0,1.0,,1.0,0.0


In [3]:
len(index_report_data)

9545

In [4]:
data_url = "https://raw.githubusercontent.com/Teplitsa/CSRLab/main/data/2022_lab_selected_ngos_data.csv"
selected_ngos_data = pd.read_csv(data_url, encoding='utf-8')
selected_ngos_data.head()

Unnamed: 0,ogrn,website,regionName,minjustForm,shortName,regDate,mainOkved,opfCombined,okvedCombined,vedSimple,websiteCreationDate
0,1026800000666,profobr68.ru,Тамбовская область,Профессиональный союз,ТАМБОВСКАЯ ОБЛАСТНАЯ ОРГАНИЗАЦИЯ ОБЩЕРОССИЙСКО...,1938-11-10,Деятельность профессиональных союзов,Профсоюзные организации,Деятельность профессиональных союзов,Профсоюзы,2016-01-15 19:31:34
1,1022900002861,проф-севмаш.рф,Архангельская область,Профессиональный союз,"МОО - ППО АО ""ПО ""СЕВМАШ"" СУДПРОФ",1942-03-15,Деятельность профессиональных союзов,Профсоюзные организации,Деятельность профессиональных союзов,Профсоюзы,2018-08-27 11:35:06
2,1025200008140,vdpor52.ru,Нижегородская область,Общественная организация,"НИЖЕГОРОДСКОЕ ОБЛАСТНОЕ ОТДЕЛЕНИЕ ВДПО, ВДПО Н...",1958-06-20,"Деятельность прочих общественных организаций, ...",Общественные организации,"Деятельность прочих общественных организаций, ...",Общественные организации,2016-09-28 12:05:06
3,1037739471220,orthonet.ru,Москва,Общественная организация,СОЮЗ ПИСАТЕЛЕЙ РОССИИ,1958-12-07,Деятельность профессиональных членских организ...,Общественные организации,Деятельность профессиональных членских организ...,Профсоюзы,2005-11-24 21:00:00
4,1085200004450,vdpo-sarov.ru,Нижегородская область,Общественная организация,"САРОВСКОЕ ""ВДПО""",1962-03-30,"Деятельность прочих общественных организаций, ...",Общественные организации,"Деятельность прочих общественных организаций, ...",Общественные организации,2021-04-15 8:10:11


In [5]:
len(selected_ngos_data)

9545

It the same dataset just with different columns. It's kind of strange that it is split into two dataframes, we are going to merge it in one based on two columns -- *website* and *ogrn*.

In [6]:
concat_ngos = pd.merge(
    index_report_data,
    selected_ngos_data,
    how="outer",
    on=["ogrn", "website"],
    sort=False,
    left_index=False,
    right_index=False,
)

In [7]:
concat_ngos.head()

Unnamed: 0,ogrn,website,bin_robots,bin_sitemap,bin_fb,bin_vk,bin_ig,bin_ok,bin_youtube,bin_tiktok,...,bin_socnet,regionName,minjustForm,shortName,regDate,mainOkved,opfCombined,okvedCombined,vedSimple,websiteCreationDate
0,1026800000666,profobr68.ru,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,Тамбовская область,Профессиональный союз,ТАМБОВСКАЯ ОБЛАСТНАЯ ОРГАНИЗАЦИЯ ОБЩЕРОССИЙСКО...,1938-11-10,Деятельность профессиональных союзов,Профсоюзные организации,Деятельность профессиональных союзов,Профсоюзы,2016-01-15 19:31:34
1,1022900002861,проф-севмаш.рф,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.166667,Архангельская область,Профессиональный союз,"МОО - ППО АО ""ПО ""СЕВМАШ"" СУДПРОФ",1942-03-15,Деятельность профессиональных союзов,Профсоюзные организации,Деятельность профессиональных союзов,Профсоюзы,2018-08-27 11:35:06
2,1025200008140,vdpor52.ru,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,Нижегородская область,Общественная организация,"НИЖЕГОРОДСКОЕ ОБЛАСТНОЕ ОТДЕЛЕНИЕ ВДПО, ВДПО Н...",1958-06-20,"Деятельность прочих общественных организаций, ...",Общественные организации,"Деятельность прочих общественных организаций, ...",Общественные организации,2016-09-28 12:05:06
3,1037739471220,orthonet.ru,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,Москва,Общественная организация,СОЮЗ ПИСАТЕЛЕЙ РОССИИ,1958-12-07,Деятельность профессиональных членских организ...,Общественные организации,Деятельность профессиональных членских организ...,Профсоюзы,2005-11-24 21:00:00
4,1085200004450,vdpo-sarov.ru,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.166667,Нижегородская область,Общественная организация,"САРОВСКОЕ ""ВДПО""",1962-03-30,"Деятельность прочих общественных организаций, ...",Общественные организации,"Деятельность прочих общественных организаций, ...",Общественные организации,2021-04-15 8:10:11


In [8]:
len(concat_ngos)

9545

The aim is to add a column in concat_ngos dataset with links to web archive snapshots.

Steps:
1. For all websites in our dataset check if they are available in web archive with the help of Wayback Availability JSON API
2. Not archived -> make a snaphot, archive it now. Archived -> is the last snapshot within the last 6 months? -> If yes just get the last link, if no create new snapshot.

### Archive how to

`waybackpy` is for creating snapshots:

In [9]:
!pip install waybackpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting waybackpy
  Downloading waybackpy-3.0.6-py3-none-any.whl (34 kB)
Installing collected packages: waybackpy
Successfully installed waybackpy-3.0.6


In [None]:
""" Uncomment this if you want to test how the api works
# After this code you will see the mark on the website that today was created a
# new snapshot for the unknown collection here https://web.archive.org/

from waybackpy import WaybackMachineSaveAPI

url = "ngolikeyou.ru"
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"

save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.save()
save_api.cached_save
"""

'https://web.archive.org/web/20220621053655/https://ngolikeyou.ru/'

Since web archive creates snapshots for a web page, not a web site, we need to get a sitemap and parse all pages for each website.

There is a lib for that:

In [10]:
!pip install ultimate-sitemap-parser

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ultimate-sitemap-parser
  Downloading ultimate_sitemap_parser-0.5-py2.py3-none-any.whl (23 kB)
Installing collected packages: ultimate-sitemap-parser
Successfully installed ultimate-sitemap-parser-0.5


Checking for snapshots is faster with command line GET requests than with the waybackpy library (about 10 times faster):

In [11]:
# test example
availability = !curl -X GET "https://archive.org/wayback/available?url=http://vdpo-sarov.ru/"
json.loads(availability[0])['archived_snapshots']['closest']['timestamp']

'20220623103945'

One might think that making snapshots will also be faster this way, but unfortunately this method is disabled. The respecting server is down and gives 502 error

Test:

In [12]:
!curl -X POST -H "Content-Type: application/json" -d '{"url": "http://vdpo-sarov.ru/tovari", "annotation": {"id": "ngos-archive", "message": ""}}' https://pragma.archivelab.org

<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.4.6 (Ubuntu)</center>
</body>
</html>


Another method is "Save Page Now" button through waybackpy lib. This button however allows no more than 15 requests per minute, otherwise it blocks the user for 5 minutes (the rule was implemented by wayback web archive in 2019)

In [None]:
# test
def save_url(url):
  save_api = WaybackMachineSaveAPI(url, user_agent)
  snapshot = save_api.save()
  return snapshot

In [None]:
# test
from tqdm import tqdm
from random import randint
from time import sleep

for url in tqdm(urls):
  sleep(randint(10,15))
  print(save_url(url))

  2%|▏         | 1/43 [00:21<14:58, 21.39s/it]

https://web.archive.org/web/20220621053655/https://ngolikeyou.ru/


  5%|▍         | 2/43 [00:42<14:23, 21.06s/it]

https://web.archive.org/web/20220621055932/https://ngolikeyou.ru/trening


  7%|▋         | 3/43 [01:02<13:57, 20.93s/it]

https://web.archive.org/web/20220621060001/https://ngolikeyou.ru/groups


  9%|▉         | 4/43 [01:25<13:54, 21.41s/it]

https://web.archive.org/web/20220621060023/https://ngolikeyou.ru/psy


 12%|█▏        | 5/43 [01:51<14:40, 23.16s/it]

https://web.archive.org/web/20220621060057/https://ngolikeyou.ru/donate


 14%|█▍        | 6/43 [02:13<14:05, 22.84s/it]

https://web.archive.org/web/20220621060129/https://ngolikeyou.ru/media


 16%|█▋        | 7/43 [02:34<13:21, 22.27s/it]

https://web.archive.org/web/20220621060201/https://ngolikeyou.ru/annual_report


 19%|█▊        | 8/43 [02:57<13:08, 22.53s/it]

https://web.archive.org/web/20220621060227/https://ngolikeyou.ru/news


 21%|██        | 9/43 [03:22<13:13, 23.33s/it]

https://web.archive.org/web/20220621065613/https://ngolikeyou.ru/tpost/tpabkefri1-otchyot-o-deyatelnosti-v-2021-godu


 23%|██▎       | 10/43 [03:57<14:48, 26.91s/it]

https://web.archive.org/web/20220621065649/https://ngolikeyou.ru/tpost/llgg6njkt1-pryamoi-efir-vsyo-o-bipolyarnom-rasstroi


 26%|██▌       | 11/43 [04:21<13:47, 25.85s/it]

https://web.archive.org/web/20220621065712/https://ngolikeyou.ru/tpost/lddlsrfuj1-pryamoi-efir-mifi-v-farmakoterapii-bipol


 28%|██▊       | 12/43 [05:49<23:10, 44.85s/it]

https://web.archive.org/web/20220621065841/https://ngolikeyou.ru/tpost/rcn3nghfp1-pryamoi-efir-kak-naiti-spetsialista-psih


 30%|███       | 13/43 [06:15<19:31, 39.07s/it]

https://web.archive.org/web/20220621065905/https://ngolikeyou.ru/tpost/7keu99jd91-pryamoi-efir-subdepressiya-chto-delat


 33%|███▎      | 14/43 [07:02<20:06, 41.62s/it]

https://web.archive.org/web/20220621065955/https://ngolikeyou.ru/tpost/syln3by991-pryamoi-efir-rpp-i-psihicheskie-zaboleva


 35%|███▍      | 15/43 [07:33<17:50, 38.23s/it]

https://web.archive.org/web/20220621070024/https://ngolikeyou.ru/tpost/5m2pge30l1-pryamoi-efir-suitsid-kak-pomoch-sebe-i-b


 37%|███▋      | 16/43 [08:01<15:51, 35.24s/it]

https://web.archive.org/web/20220621070057/https://ngolikeyou.ru/tpost/sgu42gig11-mi-zapustili-gruppu-podderzhki-dlya-lyud


 40%|███▉      | 17/43 [08:53<17:30, 40.39s/it]

https://web.archive.org/web/20220621070146/https://ngolikeyou.ru/tpost/4258bj5lp1-otkrilas-gruppa-podderzhki-dlya-lyudei-s


 42%|████▏     | 18/43 [09:33<16:41, 40.06s/it]

https://web.archive.org/web/20220621070225/https://ngolikeyou.ru/tpost/oki2yenjg1-startoval-nabor-na-trening-navikov-regul


 44%|████▍     | 19/43 [10:02<14:47, 36.96s/it]

https://web.archive.org/web/20220621070252/https://ngolikeyou.ru/tpost/ehnha01ob1-pryamoi-efir-psihicheskoe-rasstroistvo-i


 47%|████▋     | 20/43 [12:01<23:33, 61.46s/it]

https://web.archive.org/web/20220621070455/https://ngolikeyou.ru/tpost/lg4637njr1-v-novosibirske-otkrilas-gruppa-ravnoi-po


 49%|████▉     | 21/43 [12:55<21:44, 59.28s/it]

https://web.archive.org/web/20220621070548/https://ngolikeyou.ru/tpost/rvimp2itx1-zanyatie-po-art-terapii-dlya-vseh-zhelay


 51%|█████     | 22/43 [13:20<17:05, 48.83s/it]

https://web.archive.org/web/20220621070611/https://ngolikeyou.ru/tpost/f6t8c19281-mi-otkrili-ano-kak-ti


 53%|█████▎    | 23/43 [14:00<15:24, 46.25s/it]

https://web.archive.org/web/20220621070656/https://ngolikeyou.ru/tpost/iklx2zaep1-1000-podpischikov-na-youtube


 56%|█████▌    | 24/43 [14:52<15:14, 48.15s/it]

https://web.archive.org/web/20220621070746/https://ngolikeyou.ru/tpost/y9773x0st1-gruppa-podderzhki-dlya-lyudei-s-bar-otkr


 58%|█████▊    | 25/43 [15:40<14:25, 48.08s/it]

https://web.archive.org/web/20220621070831/https://ngolikeyou.ru/tpost/rags4jcdh1-pryamoi-efir-gipomaniya-chto-eto


 60%|██████    | 26/43 [16:00<11:10, 39.44s/it]

https://web.archive.org/web/20220621070857/https://ngolikeyou.ru/tpost/2ae2dhvbj1-kruglii-stol-bipolyarnoe-rasstroistvo-vr


 63%|██████▎   | 27/43 [17:17<13:34, 50.91s/it]

https://web.archive.org/web/20220621071011/https://ngolikeyou.ru/tpost/h42vvn8gl1-vishlo-metodicheskoe-posobie-po-sozdaniy


 65%|██████▌   | 28/43 [17:43<10:50, 43.39s/it]

https://web.archive.org/web/20220621071035/https://ngolikeyou.ru/tpost/520nf0lo71-efiri-o-beremennosti-i-materinstve-pri-p


 67%|██████▋   | 29/43 [18:53<11:59, 51.37s/it]

https://web.archive.org/web/20220621071147/https://ngolikeyou.ru/tpost/d6jb7l0u81-pryamoi-efir-kak-bit-s-blizkim


 70%|██████▉   | 30/43 [19:51<11:32, 53.24s/it]

https://web.archive.org/web/20220621071246/https://ngolikeyou.ru/tpost/ytyx2bao81-efir-pravda-i-mifi-ob-invalidnosti-pri-p


 72%|███████▏  | 31/43 [20:16<08:57, 44.82s/it]

https://web.archive.org/web/20220621071308/https://ngolikeyou.ru/tpost/mm4gsoba91-efir-pravda-i-mifi-ob-uchyote


 74%|███████▍  | 32/43 [21:27<09:40, 52.74s/it]

https://web.archive.org/web/20220621071415/https://ngolikeyou.ru/tpost/0cvd8660y1-kruglii-stol-psihicheskoe-zdorove-v-epoh


 77%|███████▋  | 33/43 [21:55<07:32, 45.25s/it]

https://web.archive.org/web/20220621071447/https://ngolikeyou.ru/tpost/1c29z5ja81-alyona-shibarshina-prinyala-uchastie-v-v


 79%|███████▉  | 34/43 [22:52<07:19, 48.82s/it]

https://web.archive.org/web/20220621071547/https://ngolikeyou.ru/tpost/v9jnix7x91-otkrit-nabor-v-gruppu-po-art-terapii-v-m


 81%|████████▏ | 35/43 [23:38<06:24, 48.02s/it]

https://web.archive.org/web/20220621071627/https://ngolikeyou.ru/tpost/k69zx508a1-efir-kak-spravitsya-s-emotsiyami-kogda-k


 84%|████████▎ | 36/43 [25:09<07:06, 60.96s/it]

https://web.archive.org/web/20220621071749/https://ngolikeyou.ru/tpost/8f19i77gl1-otkritoe-zanyatie-po-art-terapii-ob-emot


 86%|████████▌ | 37/43 [25:35<05:02, 50.43s/it]

https://web.archive.org/web/20220621071827/https://ngolikeyou.ru/tpost/leyn8loks1-literaturnii-master-klass-v-ekaterinburg


 88%|████████▊ | 38/43 [25:55<03:25, 41.08s/it]

https://web.archive.org/web/20220621071852/https://ngolikeyou.ru/tpost/6vfu8sd6z1-art-terapiya-v-ekaterinburge


 91%|█████████ | 39/43 [26:17<02:22, 35.54s/it]

https://web.archive.org/web/20220621071914/https://ngolikeyou.ru/tpost/gxiu1r1g51-seriya-strimov-vstrechi-s-psihologom


 93%|█████████▎| 40/43 [26:42<01:37, 32.45s/it]

https://web.archive.org/web/20220621071935/https://ngolikeyou.ru/tpost/iil8b4f6x1-kak-mi-risovali-mukoi-i-vodoi


 95%|█████████▌| 41/43 [27:07<01:00, 30.05s/it]

https://web.archive.org/web/20220621071957/https://ngolikeyou.ru/tpost/30dfzgjdu1-otkrit-nabor-na-trening-dlya-veduschih-g


 98%|█████████▊| 42/43 [27:43<00:31, 31.98s/it]

https://web.archive.org/web/20220621072034/https://ngolikeyou.ru/tpost/lb4eta0cl1-zavershilsya-ocherednoi-trening-dlya-ved


100%|██████████| 43/43 [28:11<00:00, 39.33s/it]

https://web.archive.org/web/20220621072103/https://ngolikeyou.ru/tpost/fcsuhbjug1-zakonchilsya-pervii-kurs-po-art-terapii





## Archive all

In [63]:
from waybackpy import WaybackMachineSaveAPI
from tqdm import tqdm
from random import randint
from time import sleep
from usp.tree import sitemap_tree_for_homepage
from usp.exceptions import SitemapException
from datetime import date
from dateutil import relativedelta
import re



def get_urls_from_sitemap(original_url: str) -> list:
  tree = sitemap_tree_for_homepage(original_url)
  urls = [page.url for page in tree.all_pages() if page.url.startswith(original_url)]
  return urls


def how_many_months_ago(snap_timestamp: str) -> int:
  # `snap_timestamp` looks kind of like this '20220323172342'
  current = date.today()
  snap_date = date(int(snap_timestamp[:4]),  # year
                   int(snap_timestamp[4:6]),  # month
                   int(snap_timestamp[6:8]))  # day

  r = relativedelta.relativedelta(current, snap_date)
  months_difference = (r.years * 12) + r.months
  return abs(months_difference)


def find_existing_snapshot(original_url: str) -> str:
  # check if the url is available on wayback machine in web archive
  availability = !curl -X GET {"https://archive.org/wayback/available?url=" + original_url}

  try:
    response = json.loads(availability[0])
  except JSONDecodeError:
    print(availability)
    return None

  if response["archived_snapshots"] != {}:
    snapshot = response["archived_snapshots"]["closest"]
    snap_timestamp = snapshot["timestamp"]
    # the last snapshot is made in the last half a year
    if how_many_months_ago(snap_timestamp) <= 6:
      return snapshot["url"]
    else:  # if the last snapshot is older than half a year it is not good enough for us
      return None
  else:  # case when there are no snapshots at all
    return None


def save_page_now(url: str) -> str:
  user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
  save_api = WaybackMachineSaveAPI(url, user_agent)
  snapshot = save_api.save()
  sleep(4.5)
  return snapshot


def create_snaps(urls: list) -> list:
  snaps = []
  for url in tqdm(urls):
    web_archive_snap = find_existing_snapshot(url)
    if not web_archive_snap:
      new_snap = save_page_now(url)
      snaps.append(new_snap)
    else:
      snaps.append(web_archive_snap)
  return snaps


def has_cyrillic(text):
  return bool(re.search('[а-яА-Я]', text))


def get_all_urls(entry, original_url: str) -> list:

  if entry.bin_ssl == 1.0:
    orig_url = "https://" + original_url + "/"
    urls = get_urls_from_sitemap(orig_url)
  else:
    orig_url = "http://" + original_url + "/"
    urls = get_urls_from_sitemap(orig_url)
  if not urls:  # if there is no sitemap
    urls = [orig_url]
  
  return urls

Now we can create a column where we save the snapshot urls. But we need to do it in bulks, because we need save the data to csv file from time to time.

In [14]:
concat_ngos["webarchive_snapshots"] = "" * len(concat_ngos)

In [None]:
# test
# А может такое быть что ssl нет, а сайтмап есть?
concat_ngos[(concat_ngos["bin_ssl"] != 1.0) & (concat_ngos["bin_sitemap"] == 1.0)]

Unnamed: 0,ogrn,website,bin_robots,bin_sitemap,bin_fb,bin_vk,bin_ig,bin_ok,bin_youtube,bin_tiktok,...,regionName,minjustForm,shortName,regDate,mainOkved,opfCombined,okvedCombined,vedSimple,websiteCreationDate,webarchive_snapshots
0,1026800000666,profobr68.ru,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Тамбовская область,Профессиональный союз,ТАМБОВСКАЯ ОБЛАСТНАЯ ОРГАНИЗАЦИЯ ОБЩЕРОССИЙСКО...,1938-11-10,Деятельность профессиональных союзов,Профсоюзные организации,Деятельность профессиональных союзов,Профсоюзы,2016-01-15 19:31:34,
2,1025200008140,vdpor52.ru,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Нижегородская область,Общественная организация,"НИЖЕГОРОДСКОЕ ОБЛАСТНОЕ ОТДЕЛЕНИЕ ВДПО, ВДПО Н...",1958-06-20,"Деятельность прочих общественных организаций, ...",Общественные организации,"Деятельность прочих общественных организаций, ...",Общественные организации,2016-09-28 12:05:06,
4,1085200004450,vdpo-sarov.ru,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,Нижегородская область,Общественная организация,"САРОВСКОЕ ""ВДПО""",1962-03-30,"Деятельность прочих общественных организаций, ...",Общественные организации,"Деятельность прочих общественных организаций, ...",Общественные организации,2021-04-15 8:10:11,
10,1027700270861,www.ujmos.ru,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Москва,Общественная организация,"РОО ""СЖМ"", СЖМ",1977-04-23,"Деятельность прочих общественных организаций, ...",Общественные организации,"Деятельность прочих общественных организаций, ...",Общественные организации,2006-09-13 20:00:00,
13,1033918500285,chessfed39.ru,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Калининградская область,Общественная организация,"КООО ""ШАХМАТНАЯ ФЕДЕРАЦИЯ""",1984-02-16,"Деятельность прочих общественных организаций, ...",Общественные организации,"Деятельность прочих общественных организаций, ...",Общественные организации,2019-12-05 13:43:18,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9500,1218600002840,gran-i.ru,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Ханты-Мансийский автономный округ - Югра,Автономная некоммерческая организация,АНО «ГРАНИ»,2021-03-16,Деятельность библиотек и архивов,Автономные некоммерческие организации,Деятельность библиотек и архивов,Издательство,2021-07-01 7:25:53,
9505,1211800004998,www.ushinsky-iro.ru,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Удмуртская республика,Автономная некоммерческая организация,ИНСТИТУТ РАЗВИТИЯ ОБРАЗОВАНИЯ ИМ. К. Д. УШИНСКОГО,2021-03-18,Образование профессиональное дополнительное,Автономные некоммерческие организации,Образование профессиональное дополнительное,Образование,2021-03-07 10:18:03,
9506,1217800042460,innovationcentre.ru,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Санкт-Петербург,Автономная некоммерческая организация,"АНО ЦЕНТР РАЗВИТИЯ ""ИННОВАЦИЯ""",2021-03-19,Предоставление прочих социальных услуг без обе...,Автономные некоммерческие организации,Предоставление прочих социальных услуг без обе...,Социальная работа,2021-05-23 18:25:03,
9527,1216600028897,прознания.рф,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,Свердловская область,Автономная некоммерческая организация,"АНО ДПО ""ПРОЗНАНИЯ""",2021-05-13,Образование профессиональное дополнительное,Автономные некоммерческие организации,Образование профессиональное дополнительное,Образование,2021-09-20 8:21:31,


In [64]:
i = 0
for entry in concat_ngos.iloc[:50].itertuples():
  original_url = entry.website
  if has_cyrillic(original_url):
    original_url = original_url.encode('idna').decode('utf-8')

  urls = get_all_urls(entry, original_url)
  snaps = create_snaps(urls)

  # save web archive links to the dataset
  concat_ngos.at[i, "webarchive_snapshots"] = snaps
  
  i += 1


2022-06-23 21:21:04,774 INFO usp.fetch_parse [61/MainThread]: Fetching level 0 sitemap from http://profobr68.ru/robots.txt...
2022-06-23 21:21:04,777 INFO usp.helpers [61/MainThread]: Fetching URL http://profobr68.ru/robots.txt...
2022-06-23 21:21:05,262 INFO usp.fetch_parse [61/MainThread]: Parsing sitemap from URL http://profobr68.ru/robots.txt...
2022-06-23 21:21:05,266 INFO usp.fetch_parse [61/MainThread]: Fetching level 0 sitemap from http://profobr68.ru/sitemap.xml...
2022-06-23 21:21:05,269 INFO usp.helpers [61/MainThread]: Fetching URL http://profobr68.ru/sitemap.xml...
2022-06-23 21:21:05,911 INFO usp.fetch_parse [61/MainThread]: Parsing sitemap from URL http://profobr68.ru/sitemap.xml...
2022-06-23 21:21:05,914 INFO usp.fetch_parse [61/MainThread]: Fetching level 1 sitemap from http://profobr68.ru/post-sitemap.xml...
2022-06-23 21:21:05,918 INFO usp.helpers [61/MainThread]: Fetching URL http://profobr68.ru/post-sitemap.xml...
2022-06-23 21:21:08,046 INFO usp.fetch_parse [61/M

TooManyRequestsError: ignored

Notes:

- 'https://prgura.ru/' throws TooManyRequests error algthough it is literally just one page, it's index is 60. It seems that domain name has changed, and makes a redirect or smth like that (???), maybe this is the reason

In [51]:
i

0

In [None]:
a = "string"
a[2:]

'ring'

In [45]:
concat_ngos.iloc[13].webarchive_snapshots

['http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http://chessfed39.ru/',
 'http://web.archive.org/web/20220324190836/http

### Save the data

In [None]:
concat_ngos.iloc[:50].to_csv("ngos_archived_99.csv", encoding="utf-8")

ngolikeyou.ru потыкала руками. В вебархиве появляется меньше чем через полгода после создания. В 2022 фигурирует уже активнее, какие-то автоматические роботы вебархива его находят и автоматические роботы какого-то russian-web

In [None]:
selected_ngos_data[selected_ngos_data['website'] == 'ngolikeyou.ru']

Unnamed: 0,ogrn,website,regionName,minjustForm,shortName,regDate,mainOkved,opfCombined,okvedCombined,vedSimple,websiteCreationDate
9540,1217700385539,ngolikeyou.ru,Москва,,АНО ЦЕНТР СИСТЕМНОЙ ПОДДЕРЖКИ ДЛЯ ЛЮДЕЙ С ПСИХ...,2021-08-18,Предоставление прочих социальных услуг без обе...,Автономные некоммерческие организации,Автономные некоммерческие организации,АНО,2021-08-17 15:54:57


In [None]:
!ia configure

Enter your Archive.org credentials below to configure 'ia'.

Email address: anna.klezovich24@gmail.com
Password: 

Config saved to: /root/.config/internetarchive/ia.ini


In [None]:
import internetarchive as ia

ia.download('ngolikeyou.ru')


AttributeError: ignored

In [None]:
item = ia.get_item('lettertowilliaml00doug')
item.download()

AttributeError: ignored

In [None]:
!sudo pip install pymarc

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymarc
  Downloading pymarc-4.2.0.tar.gz (230 kB)
[K     |████████████████████████████████| 230 kB 4.2 MB/s 
[?25hBuilding wheels for collected packages: pymarc
  Building wheel for pymarc (setup.py) ... [?25l[?25hdone
  Created wheel for pymarc: filename=pymarc-4.2.0-py3-none-any.whl size=155332 sha256=3923c947e5ec024bfe261453b3b2b0359af9359337d282581cf02197acbdb9bc
  Stored in directory: /root/.cache/pip/wheels/35/c1/bc/7cbc19ab89d8fea276e17106de1231e299f0bbe73135610515
Successfully built pymarc
Installing collected packages: pymarc
Successfully installed pymarc-4.2.0


In [None]:
from waybackpy import WaybackMachineSaveAPI

url = "https://github.com"
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"

save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.save()

'https://web.archive.org/web/20220618064446/https://github.com/'

In [None]:
save_api.cached_save

True

In [None]:
save_api.timestamp()

datetime.datetime(2022, 6, 18, 6, 44, 46)

In [None]:
import requests
import json
import re
from fake_useragent import UserAgent
from datetime import datetime

In [None]:
url = "ngolikeyou.ru"
ua = UserAgent()
headers = {"user-agent": ua.chrome}

timestamp = datetime.now().timestamp()
#wburl = "https://archive.org/wayback/available?url=" + url + "&timestamp=" + str(timestamp)

In [None]:
data

{'archived_snapshots': {},
 'timestamp': '1655541196.058648',
 'url': 'ngolikeyou.ru'}

In [None]:
geturl = json.loads(data)

try:
    wayback = geturl['archived_snapshots']['closest']['url']
except:
    wayback = "n/a"
    print("No snapshot URL returned")

TypeError: ignored