# TODO
1. move most of the .md text to the habr post, leave here only code comments

### Group2vec
**What?**: The notebook demostrates how to use subscriptions of users that we've crawled from social networks. We will train word2vec model, but instead of tokens will use groups

**Data used**: I use open data of users from VK social network (popular in Russia, Ukrain, etc). Data crawled using [Suvec VK crawl engine](https://github.com/ProtsenkoAI/skady-user-vectorizer), [Skady ward - crawl GUI](https://github.com/ProtsenkoAI/skady-ward), both instruments I have developed myself

**Why?** Because then we can get knowledge about groups and their users in social networks, similarly to how we analyze texts and their authors in NLP. For example, if you want to train RecSys that will work for new users of your service, you can apply group2vec to get some user features without interactions

**Why word2vec and not BERT?**: this is simple example of how you can use crawled data. Of course, BERT and other SOTA-closer methods can icrease metrics

### 1. Set things up: import packages, load data, set variables

In [111]:
import os
import json
from time import time
from typing import List, Union, Callable

import gensim
import vk_api

In [35]:
DATA_PATH = "./data"

In [4]:
parsed_pth = os.path.join(DATA_PATH, "parsed_data.json")

with open(parsed_pth) as f:
    users_data = json.load(f)

### 2. Watch in data

In [14]:
print(f"We have {len(users_data)} users (text analog for NLP)")

nb_of_groups = 0
for user_id, user_data in users_data.items():
    nb_of_groups += len(user_data["groups"])
    
print(f"They subscribed to {nb_of_groups} groups (token analog for NLP)")

We have 11767 users (text analog for NLP)
They subscribed to 1703395 groups (token analog for NLP)


In [24]:
some_user_id = next(iter(users_data.keys()))
some_user_data = users_data[some_user_id]

print("Each user has data about:", " and ".join(some_user_data.keys()))
print("'Friends' is list of other users ids: ", some_user_data["friends"][:5])
print("'Groups' is list of groups ids: ", some_user_data["groups"][:5])

Each user has data about: friends and groups
'Friends' is list of other users ids:  [10648769, 69462006, 133486963, 140577913, 143059696]
'Groups' is list of groups ids:  [75065732, 74938476, 78426877, 79562569, 81212949]


#### Intuition of group2vec
Herinafter we treat groups as "tokens" and "users" as documents.

**The idea of word2vec is**: if 2 tokens are met in similar contexts, their meaning is similar. 

**The idea of group2vec is**: if 2 groups are met in subscriptions of similar users (with a lot of common groups) this groups are similar.

**Then** as word2vec appliers make text embedding averaging words embeddings, we make users embeddings averaging groups' ones


### 3. Prepare data for word2vec

In [27]:
corpus = []
for user_data in users_data.values():
    # make strings because of gensim requirements
    document = [str(group) for group in user_data["groups"]]
    corpus.append(document)

Note: window of w2v is very large because groups don't have order, unlike words in text. 
Thus when model predicts a group, it can get information about any other group in user subscriptions

In [46]:
w2v_model = gensim.models.Word2Vec(min_count=20,
                                 window=100,
                                 vector_size=300,
                                 sample=6e-5, # downsampling popular groups
                                 alpha=0.03, 
                                 min_alpha=0.0007, 
                                 negative=20,
                                 workers=3)

In [47]:
w2v_model.build_vocab(corpus)

In [48]:
t = time()

w2v_model.train(corpus, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

Time to train the model: 4.58 mins


### 4. Test trained model

In [None]:
# authorizing in vk to get group names by ids
session = vk_api.VkApi(token=input("Enter your access token for vk\n"))

In [62]:
def get_groups_names(group_ids: List[str]):
    group_ids_encoded = ",".join(group_ids)
    resp = session.method("groups.getById", values={"group_ids": group_ids_encoded, "fields": "name"})
    
    names = [group["name"] for group in resp]
    return names

In [118]:
def apply_wv_method_print_res(group_id_or_ids: Union[str, List[str]], wv_method: Callable):
    if isinstance(group_id_or_ids, str):
        group_ids = [group_id_or_ids]
    else:
        group_ids = group_id_or_ids
        
    model_preds = wv_method(group_ids)
    similar_groups_ids = [group_id for group_id, sim_score in model_preds]
    
    groups_names = get_groups_names(group_ids + similar_groups_ids)
    
    src_group_name = groups_names[:len(group_ids)]
    group_names = groups_names[len(group_id_or_ids):]
    print(f"Similar groups for groups {src_group_name}:")
    print("\n".join(groups_names))
    
def find_similar(group_id_or_ids, model: gensim.models.Word2Vec):
    return apply_wv_method_print_res(group_id_or_ids, model.wv.most_similar)

def find_one_out(group_ids, model: gensim.models.Word2Vec):
    return apply_wv_method_print_res(group_ids, model.wv.doesnt_match)

In [120]:
# TODO: split wv_method to many methods
find_similar("109125388", model=w2v_model)
find_one_out(["109125388", "128176420", "72495085"], model=w2v_model)

Similar groups for groups ['godnotent']:
godnotent
киберпанк, который мы заслужили
абстрактные мемы для элиты всех сортов | АМДЭВС
Swipe Right
с каждым днем все радостнее жить
романтика городских окраин
Даркнет, который мы заслужили
вsратые животные
ресунки
Физика для ебанов
$$$ DANK MEMES $$$ AYY LMAO $$$


ValueError: not enough values to unpack (expected 2, got 1)

### TODO: TSNE of clusters, list of groups, several clusters