# VKontakte crawling for NGOs

For each NGO that is present in VK social media we want to find out:

- number of subscribers
- how often the post. Posts per months for starters
- number of likes on each post (then get a mean)
- number of comments on each post (then get a mean)

> 1 день -- собрать сами ссылки организаций на вк, подготовить датасет. 3 дня написать базу данных куда складывать всю информацию под вк, в json будет неудобно, потому что по сути у каждой организации отдельная свзяанная таблица по соцсетям. Потом эту бд можно использовать и для следующих соцсетей. Сбор данных 2-3 дня. Всего 6-7 дней.

## Load the dataset

In [1]:
import pandas as pd
import numpy as np
import re

In [3]:
!pip install vk_api

Defaulting to user installation because normal site-packages is not writeable


You should consider upgrading via the 'c:\program files\python38\python.exe -m pip install --upgrade pip' command.


In [3]:
!ls

2021 - Lab - Google PageSpeed Insights Lighthouse Tests.ipynb
2021 - Lab - whois check.ipynb
2021 - OpenNGO dumps filters.ipynb
2021_Lab_Tests_for_websites_communicative_capacity.ipynb
2022_ngos_hostings.ipynb
2022_ngos_vk.ipynb
2022_web_archive_ngos.ipynb
util.py


In [11]:
ngo_vks = pd.read_csv("../../vk_links.csv", encoding="utf-8")
ngo_vks.head()

Unnamed: 0,vk
0,\nhttps://vk.com/bsdipol_manager\n
1,//vk.com/avtoshkola_adpo
2,//vk.com/bfdetizemli
3,//vk.com/bryanskeparhia
4,//vk.com/budros


In [5]:
len(ngo_vks)

4522

In [12]:
all_socials = pd.read_csv("../data/2021_dec_social_networks_check.csv", encoding="utf-8")
all_socials.head()

Unnamed: 0,url,fb,vk,ig,ok,youtube,tiktok
0,http://burningheart-charity.ru,,,,,,
1,http://nkosocium.ru,,,,,,
2,http://rck-vlg.ru,,,,,,
3,http://rtsk-vlg.ru,,,,,,
4,http://intellect-foundation.ru,,https://vk.com/intellect.foundation,https://instagram.com/intellect.foundation,,,


In [13]:
all_socials[~all_socials["vk"].isnull()]

Unnamed: 0,url,fb,vk,ig,ok,youtube,tiktok
4,http://intellect-foundation.ru,,https://vk.com/intellect.foundation,https://instagram.com/intellect.foundation,,,
12,http://рбоонадежда47регион.рф,,https://vk.com/pomoschsemie.luga?w=app5619682_...,,,,
14,http://securitymedia.ru,https://www.facebook.com/industriabezopasnosti,https://vk.com/securitymedia,,,,
18,http://ecology-tatarstan.ru,,https://vk.com/public173749519,,,,
22,http://dvbfond.ru,,https://vk.com/fonddobrovoblago,https://www.instagram.com/kirillpetrovhelp_sma/,,https://www.youtube.com/channel/UCiwOoMTRTDMa2...,
...,...,...,...,...,...,...,...
12274,http://arpko.ru,http://www.facebook.com/nprpko,http://vk.com/arpko,,,,
12279,http://icrt-russia.ru,https://www.facebook.com/sot.icrt.russia.cis/,https://vk.com/icrtrussia,https://www.instagram.com/icrtrussia/,,https://www.youtube.com/channel/UCOvJ2-PBIKO5A...,
12280,http://moscowfilmfestival.ru,,https://vk.com/mmkf,,,,
12287,http://kcmol.ru,,https://vk.com/kcmkursk,https://www.instagram.com/_evolution_dance_/,,,


This dataset from Gryadka `2021_dec_social_networks_check.csv` has more vk links (4569 > 4522).

## VK API minimal example

Access token is valid only for 24 hours: https://dev.vk.com/api/getting-started

In [2]:
access_token = "vk1.a.z4MWINFBe0oM30zxrv8CPcLFDZcyExEgEM2bY9fJOUS2gtrYdjczK6JM2Zli5HAzLv0XRG_Od8Etw1PzvEYrFbgG4PPMGtgdSN02JSHWoh3cC_IIA1NP0QCYDbA8R8NR25VZWv5wfrXj1XW58iszIbm0QADCIr_m6jZI1qAwadOsSw38fhC4KCCC5H10k8Gf"

# proxy 144.76.241.45 port 7890 http
# access_token = "vk1.a.IeIpn6w6Ki9hM9HDW41gJSuNNrzEdgWZp6LzD0bz1w9OihY9DWzy2A_rpY1gW3YGRF0LEnIZlE-Z9JuKkhZrFmITmkNJdw1levUHe44udHbkUqE1zR3hER-scRvFs9A0GkHUO78fMAnTKRlPD85Z7I-5M9bcqtirmB3PClRcomgggcldRUwIq6k81IEOAqIP"

# expires_in=86400&user_id=26537712

In [3]:
import requests

response = requests.get("https://api.vk.com/method/friends.getOnline?v=5.131&access_token=" + access_token)
# proxies={"http": "http://144.76.241.45:7890", "https": "http://144.76.241.45:7890"})
response.text

'{"response":[7265079,7592964,14909720,18774990,22522985,26402925,38220308,38550321,43269671,47142125,52732921,55278550,66340837,75602894,84799526,86350202,91813902,102899215,118572031,122503981,167829096,177553912,190566081,191309594,224099941,244474708,265419885]}'

## How to obtain groups data (Minimal example)

1. Number of members:

In [4]:
from random import choice



def send_request(request):
    # send request to VK api
    user_agent_list = [
                         'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
                         'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
                         'Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0',
                         'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
                         'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
                         'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
                         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36',
                         'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10) AppleWebKit/600.1.25 (KHTML, like Gecko) Version/8.0 Safari/600.1.25',
                         'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36',
                         'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/600.1.17 (KHTML, like Gecko) Version/7.1 Safari/537.85.10',
                         'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
                         'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0',
                        ]

    user_agent = choice(user_agent_list)
    user_agent_headers = {
        'User-Agent': user_agent
    }
    # proxies = ['http://78.47.16.54:80', 'http://203.75.190.21:80', 'http://77.72.3.163:80']
    # proxy = choice(proxies)
    response = requests.get(request + access_token,
                            headers=user_agent_headers,
                            timeout=3)
                            # proxies={"http": "http://144.76.241.45:7890", "https": "http://144.76.241.45:7890"})   #, proxies = {"http": proxy, "https": proxy})
    return response.text

In [9]:
vk_request = "https://api.vk.com/method/groups.getMembers?group_id=intellect.foundation&v=5.131&access_token="

send_request(vk_request)

'{"response":{"count":53,"items":[978510,1186608,4695292,5385373,6266011,6368304,13316957,23032683,31672175,44044409,59181088,66034160,106381670,135416887,176390692,208394706,223715795,229917761,235997690,240906543,242923279,263655933,266170494,287209588,290266258,310894347,321070784,367796163,381146828,401464040,425154997,445264090,445299987,490797121,512299272,552660702,584452216,588078019,616103852,618947324,621827122,625500385,629306409,639300133,645511319,645939370,647270929,650114051,654144713,655520219,655981830,657557589,675664434]}}'

2. How often group posts. Posts per months for starters. `wall.get`

In [20]:
from pprint import pprint


vk_request = "https://api.vk.com/method/wall.get?domain=pomoschsemie.luga&v=5.131&access_token="

pprint(send_request(vk_request))

ProxyError: HTTPSConnectionPool(host='api.vk.com', port=443): Max retries exceeded with url: /method/wall.get?domain=pomoschsemie.luga&v=5.131&access_token=vk1.a.IeIpn6w6Ki9hM9HDW41gJSuNNrzEdgWZp6LzD0bz1w9OihY9DWzy2A_rpY1gW3YGRF0LEnIZlE-Z9JuKkhZrFmITmkNJdw1levUHe44udHbkUqE1zR3hER-scRvFs9A0GkHUO78fMAnTKRlPD85Z7I-5M9bcqtirmB3PClRcomgggcldRUwIq6k81IEOAqIP (Caused by ProxyError('Cannot connect to proxy.', OSError(0, 'Error')))

In [65]:
res = send_request(vk_request)

In [66]:
import json
response = json.loads(res)

In [68]:
for item in response["response"]["items"]:
    if item["post_type"] != "post":
        pprint(item)

{'comments': {'can_post': 1, 'count': 0, 'groups_can_post': True},
 'copy_history': [{'attachments': [{'photo': {'access_key': 'de4a1da3622d7f51a9',
                                              'album_id': -7,
                                              'date': 1656390169,
                                              'has_tags': False,
                                              'id': 457245273,
                                              'owner_id': -1138759,
                                              'sizes': [{'height': 75,
                                                         'type': 's',
                                                         'url': 'https://sun9-east.userapi.com/sun9-74/s/v1/if2/-MCwzWKz5B281_CQ1AotlPqfnu6Y5ry_e_1ufGMytn6kOV9Ot6bHSjHtkVNLxYiYep6im6IEkb3NoaJHLNWY2JVH.jpg?size=53x75&quality=95&type=album',
                                                         'width': 53},
                                                        {'height': 130,
   

                                                        {'height': 150,
                                                         'type': 'p',
                                                         'url': 'https://sun9-west.userapi.com/sun9-39/s/v1/if2/Cii0LC2vx0dkskPgQIz51mDuMWbRvG03PIZ4Q_4A8vbqxlFBf-Cvhzlg6xTBhFFHMWr_zxTWqOZZBEibZABC5Aj2.jpg?size=200x150&quality=95&type=album',
                                                         'width': 200},
                                                        {'height': 240,
                                                         'type': 'q',
                                                         'url': 'https://sun9-west.userapi.com/sun9-39/s/v1/if2/AdpwToZ_ORciQw8GaJwB4Cp85GSx0-SMZz8QgidmErXc0D-Pct1CC4SWPjeSC4hUp8W4icrl3utzIdozQwVWHaVy.jpg?size=320x240&quality=95&type=album',
                                                         'width': 320},
                                                        {'height': 382,
                

                                                         'url': 'https://sun1.userapi.com/sun1-55/s/v1/if2/3S9vWB_sJ21QfG2gliOGC4bhlVwIZnzYwRuTwOoA-ZQ575J8axXOXvz1sYLnHncYaO7hDVEqrk6j6txrpi7RS5Zy.jpg?size=1280x958&quality=95&type=album',
                                                         'width': 1280},
                                                        {'height': 97,
                                                         'type': 'o',
                                                         'url': 'https://sun1.userapi.com/sun1-55/s/v1/if2/ME9vIIYQcvHCCLUkTxF8deNAWAHfahyKUR8X5R6qzLxaiSa8NmpBuwNAElalQZqZdWDfypRl6JB-oMoiRIzoifZe.jpg?size=130x97&quality=95&type=album',
                                                         'width': 130},
                                                        {'height': 150,
                                                         'type': 'p',
                                                         'url': 'https://sun1.userapi.com/sun1-55/

                                                        {'height': 1032,
                                                         'type': 'z',
                                                         'url': 'https://sun1.userapi.com/sun1-19/s/v1/if2/EJwMRotZQ1fLZnmkTmMglUlBuqeCrKdgGxMsJaj3tMBqj7FJXqRnAlZi6r228RjTqw_HREyU-CVPydFvsrlswikB.jpg?size=774x1032&quality=95&type=album',
                                                         'width': 774},
                                                        {'height': 173,
                                                         'type': 'o',
                                                         'url': 'https://sun1.userapi.com/sun1-19/s/v1/if2/4W5GoyGf8gVm2PB6NN-C61nX6b8qsZDoUkh415suiQ2qjVVNpURVbp-cNeeHQdEHqFiLetZE4JZpAcjIptxEaOmS.jpg?size=130x173&quality=95&type=album',
                                                         'width': 130},
                                                        {'height': 267,
                        

                           'кое-какую мебель, посуду. \n'
                           'Но стиральная машина нужна большая на 10 кг, цены '
                           'на технику ужасные. Матушка уже готова была руками '
                           'стирать, но пока, слава Богу, прихожане помогают '
                           'со стиркой. Благословили обратиться в епархиальную '
                           'службу "Милосердие". \n'
                           'Статья получилась в общем хорошая, хоть и '
                           'недословно передано на ходу сделанное телефонное '
                           'интервью. Спасибо матушке Ксении. \n'
                           'https://www.ekbmiloserdie.ru/needle/1130\n'
                           '\n'
                           'P. S. Пока не очень ориентируюсь, поэтому '
                           'обращаюсь в длверенные фонды, к волонтерам. Я уже '
                           'получил карту "Мир", привязал к российскому номеру '
              

                                                            'красивые..',
                                             'is_favorite': False,
                                             'target': 'internal',
                                             'title': '• Твоя лента • Тебе это '
                                                      'нравится!',
                                             'url': 'https://vk.com/app2656913'},
                                    'type': 'link'}],
                   'date': 1657132071,
                   'from_id': 231303197,
                   'id': 139042,
                   'owner_id': 231303197,
                   'post_source': {'platform': 'android', 'type': 'api'},
                   'post_type': 'post',
                   'text': 'Не брак плох, а плохо прелюбодеяние, блуд, а брак '
                           'есть лекарство, удаляющее от блуда. … Не будем '
                           'пренебрегать нашим спасением и не отдадим душу '
     

                                                        {'height': 551,
                                                         'type': 'r',
                                                         'url': 'https://sun1.userapi.com/sun1-24/s/v1/if2/bCpayF_fozfEkMByGDVlvjpK6GvxhZ-vbgcgC9_iysnI7fFEDeu0Un6M0sb35YBu4MO4luOJFbOghQj4ftfqFDO5.jpg?size=510x551&quality=95&type=album',
                                                         'width': 510}],
                                              'text': '',
                                              'user_id': 100},
                                    'type': 'photo'}],
                   'date': 1656155700,
                   'from_id': -32332910,
                   'id': 758284,
                   'owner_id': -32332910,
                   'post_source': {'platform': 'iphone', 'type': 'api'},
                   'post_type': 'post',
                   'text': 'Внимание, МАКСИМАЛЬНЫЙ РЕПОСТ! Нашему военному '
                           

                           'рассказывает, какая она стерва или делает вид, что '
                           'всё хорошо. Одним страшно выносить сор из избы, а '
                           'другие сплетничают о своих половинках своим '
                           'друзьям и подругам.\n'
                           ' \n'
                           ' Может, им просто не повезло друг с другом?\n'
                           ' \n'
                           ' Некоторым кажется, что у всех непременно должно '
                           'быть дома плохо, как в популярных сериальчиках, '
                           'ведь тогда можно ничего не делать. Думаете, '
                           'семейное счастье - это лотерея или провидение '
                           'Божье? Одним повезло, а другим нет? В Браке живут '
                           'вместе или по отдельности. Если вы вместе, станете '
                           'родными, если каждый сам по себе, то так и '
                           'оста

In [47]:
# TEST: check how to extract month
from datetime import datetime
from collections import defaultdict

timestamp = 1630510421
dt_object = datetime.fromtimestamp(timestamp)

print("dt_object =", dt_object.strftime("%m"))  # "%Y"

months = defaultdict(list)
for item in response["response"]["items"]:
    timestamp = item["date"]
    dt_object = datetime.fromtimestamp(timestamp)
    year = dt_object.strftime("%Y")
    month = dt_object.strftime("%m")
    months[str(year) + str(month)].append(item["id"])

print(months, response["response"]["count"])
posts_per_month = sum([len(val) for val in months.values()]) / len(months)
print(posts_per_month)

dt_object = 09
defaultdict(<class 'list'>, {'202109': [10], '202108': [9, 8, 5, 4], '202106': [3]}) 6
2.0


3. Number of likes on each post (then get a mean) & 

4. Number of comments on each post (then get a mean) `wall.getComments`

Likes are counted in the method `wall.get` above `response[0]["likes"]["count"]`. Also I can get there number of comments `response[0]["comments"]["count"]`, number of reposts `response[0]["reposts"]["count"]`, number of total VIEWS of the post `response[0]["views"]["count"]`. 

### Check links

In [39]:
ngos_with_vk = all_socials[~all_socials["vk"].isnull()]

In [40]:
ngos_with_vk.head()

Unnamed: 0,url,fb,vk,ig,ok,youtube,tiktok
4,http://intellect-foundation.ru,,https://vk.com/intellect.foundation,https://instagram.com/intellect.foundation,,,
12,http://рбоонадежда47регион.рф,,https://vk.com/pomoschsemie.luga?w=app5619682_...,,,,
14,http://securitymedia.ru,https://www.facebook.com/industriabezopasnosti,https://vk.com/securitymedia,,,,
18,http://ecology-tatarstan.ru,,https://vk.com/public173749519,,,,
22,http://dvbfond.ru,,https://vk.com/fonddobrovoblago,https://www.instagram.com/kirillpetrovhelp_sma/,,https://www.youtube.com/channel/UCiwOoMTRTDMa2...,


In [41]:
re.search(r'(.+)\?', ngos_with_vk.iloc[1]["vk"]).group(1)

'https://vk.com/pomoschsemie.luga'

In [42]:
ngos_with_vk["vk"] = ngos_with_vk["vk"].apply(lambda x: re.search(r'(.+)\?', x).group(1) if "?" in x else x)    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [43]:
ngos_with_vk.head(50)

Unnamed: 0,url,fb,vk,ig,ok,youtube,tiktok
4,http://intellect-foundation.ru,,https://vk.com/intellect.foundation,https://instagram.com/intellect.foundation,,,
12,http://рбоонадежда47регион.рф,,https://vk.com/pomoschsemie.luga,,,,
14,http://securitymedia.ru,https://www.facebook.com/industriabezopasnosti,https://vk.com/securitymedia,,,,
18,http://ecology-tatarstan.ru,,https://vk.com/public173749519,,,,
22,http://dvbfond.ru,,https://vk.com/fonddobrovoblago,https://www.instagram.com/kirillpetrovhelp_sma/,,https://www.youtube.com/channel/UCiwOoMTRTDMa2...,
24,http://адвокат-юрист-красногорск.рф,,https://vk.com/mouc_krasnogorsk,,,,
29,http://nf2217.ru,https://www.facebook.com/МРОО-помощи-пациентам...,https://vk.com/public194953135,https://www.instagram.com/nf_2217/,,https://www.youtube.com/channel/UCp693gH6frXdB...,
32,http://nko-pazvitiye.ru,,https://vk.com/nkorazvitie64,https://www.instagram.com/nko_razvitie64/,,,
33,http://polzamarket.ru,https://www.facebook.com/u.bioproduct/,https://vk.com/polzamarket_tmn,https://www.instagram.com/polzamarket.rf/,,https://www.youtube.com/channel/UChUFd04mbv3EA...,
35,http://добрубыть.рф,,https://vk.com/dobrubit72,https://www.instagram.com/dobrubit/,https://ok.ru/group/57613371506704,,


Some of the links weirdly send directly to vk.com, not to a specific group page. We clean those out:

In [44]:
ngos_with_vk = ngos_with_vk[~ngos_with_vk["vk"].str.endswith(".com/")]

In [45]:
ngos_with_vk.head(30)

Unnamed: 0,url,fb,vk,ig,ok,youtube,tiktok
4,http://intellect-foundation.ru,,https://vk.com/intellect.foundation,https://instagram.com/intellect.foundation,,,
12,http://рбоонадежда47регион.рф,,https://vk.com/pomoschsemie.luga,,,,
14,http://securitymedia.ru,https://www.facebook.com/industriabezopasnosti,https://vk.com/securitymedia,,,,
18,http://ecology-tatarstan.ru,,https://vk.com/public173749519,,,,
22,http://dvbfond.ru,,https://vk.com/fonddobrovoblago,https://www.instagram.com/kirillpetrovhelp_sma/,,https://www.youtube.com/channel/UCiwOoMTRTDMa2...,
24,http://адвокат-юрист-красногорск.рф,,https://vk.com/mouc_krasnogorsk,,,,
29,http://nf2217.ru,https://www.facebook.com/МРОО-помощи-пациентам...,https://vk.com/public194953135,https://www.instagram.com/nf_2217/,,https://www.youtube.com/channel/UCp693gH6frXdB...,
32,http://nko-pazvitiye.ru,,https://vk.com/nkorazvitie64,https://www.instagram.com/nko_razvitie64/,,,
33,http://polzamarket.ru,https://www.facebook.com/u.bioproduct/,https://vk.com/polzamarket_tmn,https://www.instagram.com/polzamarket.rf/,,https://www.youtube.com/channel/UChUFd04mbv3EA...,
35,http://добрубыть.рф,,https://vk.com/dobrubit72,https://www.instagram.com/dobrubit/,https://ok.ru/group/57613371506704,,


Some of the links start with "oauth.", but we need only vk domain:

In [46]:
ngos_with_vk = ngos_with_vk[ngos_with_vk["vk"].str.startswith("https://vk.com/")]

In [47]:
ngos_with_vk.head(30)

Unnamed: 0,url,fb,vk,ig,ok,youtube,tiktok
4,http://intellect-foundation.ru,,https://vk.com/intellect.foundation,https://instagram.com/intellect.foundation,,,
12,http://рбоонадежда47регион.рф,,https://vk.com/pomoschsemie.luga,,,,
14,http://securitymedia.ru,https://www.facebook.com/industriabezopasnosti,https://vk.com/securitymedia,,,,
18,http://ecology-tatarstan.ru,,https://vk.com/public173749519,,,,
22,http://dvbfond.ru,,https://vk.com/fonddobrovoblago,https://www.instagram.com/kirillpetrovhelp_sma/,,https://www.youtube.com/channel/UCiwOoMTRTDMa2...,
24,http://адвокат-юрист-красногорск.рф,,https://vk.com/mouc_krasnogorsk,,,,
29,http://nf2217.ru,https://www.facebook.com/МРОО-помощи-пациентам...,https://vk.com/public194953135,https://www.instagram.com/nf_2217/,,https://www.youtube.com/channel/UCp693gH6frXdB...,
32,http://nko-pazvitiye.ru,,https://vk.com/nkorazvitie64,https://www.instagram.com/nko_razvitie64/,,,
33,http://polzamarket.ru,https://www.facebook.com/u.bioproduct/,https://vk.com/polzamarket_tmn,https://www.instagram.com/polzamarket.rf/,,https://www.youtube.com/channel/UChUFd04mbv3EA...,
35,http://добрубыть.рф,,https://vk.com/dobrubit72,https://www.instagram.com/dobrubit/,https://ok.ru/group/57613371506704,,


Work with "shape.php" kind of links:

In [54]:
ngos_with_vk = ngos_with_vk[~ngos_with_vk["vk"].str.startswith("https://vk.com/share.php")]

In [55]:
ngos_with_vk.shape[0]

3765

Save file

In [48]:
ngos_with_vk.to_csv("../data/2021_dec_social_networks_vk_clean.csv", encoding="utf-8")

In [None]:
# For next runs
# ngos_with_vk = pd.read_csv("../data/2021_dec_social_networks_vk_clean.csv", encoding="utf-8")
# ngos_with_vk.head()

In [59]:
ngo_vks = ngo_vks[~ngo_vks["vk"].isnull()]

In [60]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: re.search(r'(.+)\?', x).group(1) if "?" in x else x)    

In [61]:
ngo_vks = ngo_vks[~ngo_vks["vk"].str.endswith(".com/")]

In [62]:
ngo_vks = ngo_vks[ngo_vks["vk"].str.startswith("https://vk.com/")]

In [63]:
ngo_vks = ngo_vks[~ngo_vks["vk"].str.startswith("https://vk.com/share.php")]

In [64]:
ngo_vks.shape[0]

3871

In the end vk_links.csv has more valid links although it is shorter. So I'm going to use this dataset. Let us save it:

In [65]:
ngo_vks.to_csv("../data/2021_dec_social_networks_vk_clean.csv", encoding="utf-8")

In [129]:
ngo_vks = pd.read_csv("../data/2021_dec_social_networks_vk_clean.csv", encoding="utf-8")
ngo_vks.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,vk
0,0,633,https://vk.com/ratris
1,1,634,https://vk.com//public203415983
2,2,635,https://vk.com//public203659500
3,3,636,https://vk.com/15days_victory
4,4,637,https://vk.com/18blagodar


More problematic link types:

- https://vk.com/wall-79408209 link to all posts on the wall of the public -> https://vk.com/public79408209
- https://vk.com/wall-111030342_8435  link to a specific post on the wall of a public -> https://vk.com/public111030342
- https://vk.com/topic-88758316_31319416  link to a topic in a public we are interested in -> https://vk.com/public88758316
- https://vk.com/@ratris  it is a site page not a main vk public page -> https://vk.com/ratris
- https://vk.com/prodlenkaplushttps://vk.com/prodlenkaplus  simple mistake -> https://vk.com/prodlenkaplus
- some of the links have `/` on the end -> strip
- https://vk.com/album-104692916_282960157 link to an album -> https://vk.com/public104692916

In [130]:
link = "https://vk.com/wall-79408209"

if re.match(r'.*?/wall-\d+$', link):
    print(link.replace('wall-', 'public'))

https://vk.com/public79408209


In [131]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: x.replace('wall-', 'public') if re.match(r'.*?/wall-\d+$', x) else x)

In [102]:
link = "https://vk.com/wall-111030342_435"

if re.match(r'.*?/wall-\d+?_\d+$', link):
    link = link.split('_')[0]
    print(link.replace('wall-', 'public'))

https://vk.com/public111030342


In [132]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: x.split('_')[0].replace('wall-', 'public') if re.match(r'.*?/wall-\d+?_\d+$', x) else x)

In [105]:
link = "https://vk.com/topic-88758316_31319416"

if re.match(r'.*?/topic-\d+?_\d+$', link):
    link = link.split('_')[0]
    print(link.replace('topic-', 'public'))

https://vk.com/public88758316


In [133]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: x.split('_')[0].replace('wall-', 'public') if re.match(r'.*?/wall-\d+?_\d+$', x) else x)

In [123]:
link = "https://vk.com/album-104692916_282960157"

if re.match(r'.*?/album-\d+?_\d+$', link):
    link = link.split('_')[0]
    print(link.replace('album-', 'public'))

https://vk.com/public104692916


In [134]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: x.split('_')[0].replace('album-', 'public') if re.match(r'.*?/album-\d+?_\d+$', x) else x)

In [127]:
link = "https://vk.com/albums-149498770"

if re.match(r'.*?/albums-\d+$', link):
    print(link.replace('albums-', 'public'))

https://vk.com/public149498770


In [135]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: x.replace('albums-', 'public') if re.match(r'.*?/albums-\d+$', x) else x)

In [90]:
link = "https://vk.com/@ratris"

if "/@" in link:
    print(link.replace('@', ''))

https://vk.com/ratris


In [136]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: x.replace('@', '') if "/@" in x else x)

In [108]:
link = "https://vk.com/prodlenkaplushttps://vk.com/prodlenkaplus"
dubles_test = link.split('https://')
if len(dubles_test) > 2:
    print("https://" + dubles_test[1])

https://vk.com/prodlenkaplus


In [137]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: "https://" + x.split('https://')[1] if len(x.split('https://')) > 2 else x)

In [139]:
link = "https://vk.com/album220163822_240663742"

if re.match(r'.*?/album\d+?_\d+$', link):
    link = link.split('_')[0]
    print(link.replace('album', 'public'))

https://vk.com/public220163822


In [140]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: x.split('_')[0].replace('album', 'public') if re.match(r'.*?/album\d+?_\d+$', x) else x)

In [148]:
link = "https://vk.com/app5619682_-37666750#502331"

if re.match(r'.*?/app\d+?_-\d+?#\d+$', link):
    m = re.search(r'-(\d+?)#', link)
    print("https://vk.com/public" + m.group(1))

https://vk.com/public37666750


In [149]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: "https://vk.com/public" + re.search(r'-(\d+?)#', x).group(1) if re.match(r'.*?/app\d+?_-\d+?#\d+$', x) else x)

In [152]:
link = "https://vk.com/app5708398_-142199366"

if re.match(r'.*?/app\d+?_-\d+$', link):
    m = re.search(r'-(\d+)', link)
    print("https://vk.com/public" + m.group(1))

https://vk.com/public142199366


In [153]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: "https://vk.com/public" + re.search(r'-(\d+)', x).group(1) if re.match(r'.*?/app\d+?_-\d+$', x) else x)

In [155]:
link = "https://vk.com/doc-134882220_448563071"

if re.match(r'.*?/doc-\d+?_\d+$', link):
    link = link.split('_')[0]
    print(link.replace('doc-', 'public'))

https://vk.com/public134882220


In [156]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: x.split('_')[0].replace('doc-', 'public') if re.match(r'.*?/doc-\d+?_\d+$', x) else x)

In [158]:
link = "https://vk.com/doc345920167_631326625" # this link is just broken

if re.match(r'.*?/doc\d+?_\d+$', link):
    link = link.split('_')[0]
    print(link.replace('doc', 'public'))

https://vk.com/public345920167


In [159]:
ngo_vks.drop(ngo_vks[ngo_vks["vk"] == "https://vk.com/doc345920167_631326625"].index, inplace=True)

In [162]:
ngo_vks.drop(ngo_vks[ngo_vks["vk"] == "https://vk.com/im"].index, inplace=True)

In [163]:
link = "https://vk.com/docs-125680016"

if re.match(r'.*?/docs-\d+$', link):
    print(link.replace('docs-', 'public'))

https://vk.com/public125680016


In [164]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: x.replace('docs-', 'public') if re.match(r'.*?/docs-\d+$', x) else x)

In [167]:
ngo_vks.drop(ngo_vks[ngo_vks["vk"] == "https://vk.com/images/blank.gif"].index, inplace=True)

In [170]:
link = "https://vk.com/podcast-3424373_456240564"

if re.match(r'.*?/podcast-\d+?_\d+$', link):
    link = link.split('_')[0]
    print(link.replace('podcast-', 'public'))

https://vk.com/public3424373


In [171]:
ngo_vks["vk"] = ngo_vks["vk"].apply(lambda x: x.split('_')[0].replace('podcast-', 'public') if re.match(r'.*?/podcast-\d+?_\d+$', x) else x)

In [174]:
ngo_vks.to_csv("../data/2021_dec_social_networks_vk_clean.csv", encoding="utf-8")

In [10]:
# on the next run just read
ngo_vks = pd.read_csv("../data/2021_dec_social_networks_vk_clean.csv", encoding="utf-8")

## Test loop

In [8]:
ngo_vks.shape[0]

3864

In [9]:
def get_count(method, group_id):
    #print(group_id, " ", entry.vk)
    if 'public' in group_id:
        group_num = re.sub('public', '', group_id)
        vk_request = "https://api.vk.com/method/" + method + group_num + "&v=5.131&access_token="
    else:
        vk_request = "https://api.vk.com/method/" + method + group_id + "&v=5.131&access_token="
    # "error_msg":"Access denied: group hide members"
    # "error_msg":"Invalid group id"
    response = json.loads(send_request(vk_request))
    return response


def get_posts_per_month(response_wall):
    if "response" in response_wall:
        months = defaultdict(list)
        for item in response_wall["response"]["items"]:
            timestamp = item["date"]
            dt_object = datetime.fromtimestamp(timestamp)
            year = dt_object.strftime("%Y")
            month = dt_object.strftime("%m")
            months[str(year) + str(month)].append(item["id"])
        if len(months) > 0:
            posts_per_month = sum([len(val) for val in months.values()]) / len(months)
            return round(posts_per_month, 2)
        else:
            return None
    else:
        return None


def get_count_and_mean(response_wall: dict, feature: str) -> tuple:
    if "response" in response_wall:
        counts = []
        for item in response_wall["response"]["items"]:
            if feature not in item:
                return None, None
            count = item[feature]["count"]
            counts.append(count)
        if len(counts) > 0:
            mean_val = sum(counts) / len(counts)
            return sum(counts), round(mean_val, 2)
        else:
            return None, None
    else:
        return None, None
    
    
def get_binary_param(response_wall: dict, feature: str) -> tuple:
    if "response" in response_wall:
        ads = []
        for item in response_wall["response"]["items"]:
            if feature not in item:
                return None, None
            ad_or_not = item[feature]
            ads.append(ad_or_not)
        if len(ads) > 0:
            mean_val = sum(ads) / len(ads)
            return sum(ads), round(mean_val, 2)
        else:
            return None, None
    else:
        return None, None

In [33]:
a = list("abc")
a[1:2]

['b']

In [10]:
from time import sleep
from random import randint
import json
from datetime import datetime
from collections import defaultdict


count_members = []
count_posts_all = []
count_posts_per_month = []
likes_overall_all, likes_mean_all = [], []
comments_overall_all, comments_mean_all = [], []
reposts_overall_all, reposts_mean_all = [], []
views_overall_all, views_mean_all = [], []
ads_overall_all, ads_mean_all = [], []

for entry in ngo_vks.iloc[3000:4000].itertuples():
    group_id = re.sub('https://vk.com/', '', entry.vk).strip('/')
    
    # get members count
    members_method = "groups.getMembers?group_id="
    response = get_count(members_method, group_id)
    if "response" in response:
        count_members.append(response["response"]["count"])
    else:
        count_members.append(None)
    
    # get posts overall count
    posts_method = "wall.get?domain="
    response_wall = get_count(posts_method, group_id)
    if "response" in response_wall:
        count_posts_all.append(response_wall["response"]["count"])
    else:
        count_posts_all.append(None)
    
    # get posts per month
    posts_per_month = get_posts_per_month(response_wall)
    count_posts_per_month.append(posts_per_month)
    
    # get likes overall count & get mean likes per post
    likes_overall, likes_mean = get_count_and_mean(response_wall, "likes")
    # get comments overall count & get comments per post
    comments_overall, comments_mean = get_count_and_mean(response_wall, "comments")
    # get reposts overall count & get reposts per post
    reposts_overall, reposts_mean = get_count_and_mean(response_wall, "likes")
    # get views overall count & get views per post
    views_overall, views_mean = get_count_and_mean(response_wall, "views")
    
    # 'marked_as_ads' binary parameter. Also important to my mind
    ads_overall, ads_mean = get_binary_param(response_wall, "marked_as_ads")
    
    # append all parameters that are left to make a df column
    likes_overall_all.append(likes_overall)
    likes_mean_all.append(likes_mean)
    comments_overall_all.append(comments_overall)
    comments_mean_all.append(comments_mean)
    reposts_overall_all.append(reposts_overall)
    reposts_mean_all.append(reposts_mean)
    views_overall_all.append(views_overall)
    views_mean_all.append(views_mean)
    ads_overall_all.append(ads_overall)
    ads_mean_all.append(ads_mean)
    
    sleep(randint(0,2))

In [11]:
len(count_members)

864

In [12]:
# test
print(len(views_overall_all), len(count_posts_all), len(likes_overall_all), len(likes_mean_all), len(comments_overall_all), len(comments_mean_all), len(reposts_overall_all), len(reposts_mean_all))

864 864 864 864 864 864 864 864


In [23]:
# create columns
# first run

ngo_vks["count_members"] = count_members + [None] * (len(ngo_vks)-393-1)  # !!!! uneven
ngo_vks["count_posts_all"] = count_posts_all + [None] * (len(ngo_vks)-393)
ngo_vks["count_posts_per_month"] = count_posts_per_month + [None] * (len(ngo_vks)-393)
ngo_vks["likes_overall_all"] = likes_overall_all + [None] * (len(ngo_vks)-393)
ngo_vks["likes_mean_all"] = likes_mean_all + [None] * (len(ngo_vks)-393)
ngo_vks["comments_overall_all"] = comments_overall_all + [None] * (len(ngo_vks)-393)
ngo_vks["comments_mean_all"] = comments_mean_all + [None] * (len(ngo_vks)-393)
ngo_vks["reposts_overall_all"] = reposts_overall_all + [None] * (len(ngo_vks)-393)
ngo_vks["reposts_mean_all"] = reposts_mean_all + [None] * (len(ngo_vks)-393)
ngo_vks["views_overall_all"] = views_overall_all + [None] * (len(ngo_vks)-393)
ngo_vks["views_mean_all"] = views_mean_all + [None] * (len(ngo_vks)-393)
ngo_vks["ads_overall_all"] = ads_overall_all + [None] * (len(ngo_vks)-393)
ngo_vks["ads_mean_all"] = ads_mean_all + [None] * (len(ngo_vks)-393)

In [14]:
# next runs are in batches
k = 3000  # сдвиг
for i in range(3000, 3864):
    ngo_vks.at[i, "count_members"] = count_members[i-k]
    ngo_vks.at[i, "count_posts_all"] = count_posts_all[i-k]
    ngo_vks.at[i, "count_posts_per_month"] = count_posts_per_month[i-k]
    ngo_vks.at[i, "likes_overall_all"] = likes_overall_all[i-k]
    ngo_vks.at[i, "likes_mean_all"] = likes_mean_all[i-k]
    ngo_vks.at[i, "comments_overall_all"] = comments_overall_all[i-k]
    ngo_vks.at[i, "comments_mean_all"] = comments_mean_all[i-k]
    ngo_vks.at[i, "reposts_overall_all"] = reposts_overall_all[i-k]
    ngo_vks.at[i, "reposts_mean_all"] = reposts_mean_all[i-k]
    ngo_vks.at[i, "views_overall_all"] = views_overall_all[i-k]
    ngo_vks.at[i, "views_mean_all"] = views_mean_all[i-k]
    ngo_vks.at[i, "ads_overall_all"] = ads_overall_all[i-k]
    ngo_vks.at[i, "ads_mean_all"] = ads_mean_all[i-k]

In [17]:
ngo_vks.iloc[3861:3880]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,vk,count_members,count_posts_all,count_posts_per_month,likes_overall_all,likes_mean_all,comments_overall_all,comments_mean_all,reposts_overall_all,reposts_mean_all,views_overall_all,views_mean_all,ads_overall_all,ads_mean_all
3861,3861,3868,3868,4501,https://vk.com/zrp56,294.0,574.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,178.0,8.9,0.0,0.0
3862,3862,3869,3869,4502,https://vk.com/zukk174,,15882.0,10.0,912.0,45.6,35.0,1.75,912.0,45.6,86178.0,4308.9,0.0,0.0
3863,3863,3870,3870,4503,https://vk.com/zvezda_2005,6965.0,6187.0,10.0,2159.0,107.95,150.0,7.5,2159.0,107.95,55144.0,2757.2,0.0,0.0


In [18]:
ngo_vks.tail()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,vk,count_members,count_posts_all,count_posts_per_month,likes_overall_all,likes_mean_all,comments_overall_all,comments_mean_all,reposts_overall_all,reposts_mean_all,views_overall_all,views_mean_all,ads_overall_all,ads_mean_all
3859,3859,3866,3866,4499,https://vk.com/zoolimpopo,45227.0,2271.0,10.0,7245.0,362.25,232.0,11.6,7245.0,362.25,459325.0,22966.25,0.0,0.0
3860,3860,3867,3867,4500,https://vk.com/zpokolenie33,540.0,1857.0,10.0,256.0,12.8,1.0,0.05,256.0,12.8,6890.0,344.5,0.0,0.0
3861,3861,3868,3868,4501,https://vk.com/zrp56,294.0,574.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,178.0,8.9,0.0,0.0
3862,3862,3869,3869,4502,https://vk.com/zukk174,,15882.0,10.0,912.0,45.6,35.0,1.75,912.0,45.6,86178.0,4308.9,0.0,0.0
3863,3863,3870,3870,4503,https://vk.com/zvezda_2005,6965.0,6187.0,10.0,2159.0,107.95,150.0,7.5,2159.0,107.95,55144.0,2757.2,0.0,0.0


In [33]:
# save data
ngo_vks.to_csv("../data/2021_dec_vk_parsed.csv", encoding="utf-8")

In [5]:
# read data to fill it up
ngo_vks = pd.read_csv("../data/2021_dec_vk_parsed.csv", encoding="utf-8")

## Short analytics

In [31]:
ngo_vks = ngo_vks.drop(ngo_vks.columns[[0,1,2,3]], axis=1)

In [77]:
ngo_vks.head()

Unnamed: 0,vk,count_members,count_posts_all,count_posts_per_month,likes_overall_all,likes_mean_all,comments_overall_all,comments_mean_all,reposts_overall_all,reposts_mean_all,views_overall_all,views_mean_all,ads_overall_all,ads_mean_all,group_id
0,https://vk.com/ratris,3985.0,1707.0,6.67,230.0,11.5,14.0,0.7,230.0,11.5,7150.0,357.5,0.0,0.0,ratris
1,https://vk.com//public203415983,5.0,18.0,1.29,458.0,25.44,13.0,0.72,458.0,25.44,,,,,public203415983
2,https://vk.com//public203659500,54.0,18.0,1.29,458.0,25.44,13.0,0.72,458.0,25.44,,,,,public203659500
3,https://vk.com/15days_victory,146.0,106.0,2.5,4.0,0.2,0.0,0.0,4.0,0.2,,,0.0,0.0,15days_victory
4,https://vk.com/18blagodar,751.0,807.0,4.0,532.0,26.6,23.0,1.15,532.0,26.6,23910.0,1195.5,0.0,0.0,18blagodar


In [75]:
# add group_id column to make graphs prettier
group_names = []
for entry in ngo_vks.itertuples():
    group_id = re.sub('https://vk.com/', '', entry.vk).strip('/')
    group_names.append(group_id)

In [76]:
ngo_vks["group_id"] = group_names

In [78]:
import plotly.express as px


fig = px.bar(ngo_vks[ngo_vks["count_members"] > 130000], x="group_id", y="count_members", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()

In [79]:
fig = px.bar(ngo_vks[ngo_vks["count_posts_all"] > 40000], x="group_id", y="count_posts_all", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()

In [80]:
fig = px.bar(ngo_vks[ngo_vks["count_posts_per_month"] > 1], x="group_id", y="count_posts_per_month", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()

In [82]:
fig = px.bar(ngo_vks[ngo_vks["likes_overall_all"] > 6000], x="group_id", y="likes_overall_all", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()

In [87]:
fig = px.bar(ngo_vks[ngo_vks["likes_mean_all"] > 200], x="group_id", y="likes_mean_all", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()

In [88]:
fig = px.bar(ngo_vks[ngo_vks["comments_overall_all"] > 200], x="group_id", y="comments_overall_all", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()

In [93]:
fig = px.bar(ngo_vks[ngo_vks["reposts_overall_all"] > 5000], x="group_id", y="reposts_overall_all", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()

In [98]:
fig = px.bar(ngo_vks[ngo_vks["reposts_mean_all"] > 200], x="group_id", y="reposts_mean_all", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()

In [110]:
fig = px.bar(ngo_vks[ngo_vks["views_overall_all"] > 190000], x="group_id", y="views_overall_all", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()

In [113]:
fig = px.bar(ngo_vks[ngo_vks["views_mean_all"] > 20000], x="group_id", y="views_mean_all", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()

In [117]:
fig = px.bar(ngo_vks[ngo_vks["ads_overall_all"] > 0], x="group_id", y="ads_overall_all", title="Stats")
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig.show()