# Практическая работа "Психометрика пользователей социальной сети"

Сделано Демином Олегом, J42111, 335428

## Задание

С помощью ВК API или парсера собрать данные 10 пользователей:
1. Посты со стены на полную глубину
2. Построить эмоциональные профили пользователей:
  - Используйте модель определения эмоциональности русскоязычных текстов https://huggingface.co/cointegrated/rubert-tiny2-cedr-emotiondetection
  - Постройте статические профили для пользователей, усреднив эмоциональные цифровые следы и выявив, таким образом, эмоциональные доминанты
  - Постройте динамические профили для пользователей, усреднив эмоциональные цифровые следы раздельно по годам и выявив ежегодные эмоциональные доминанты
  - Составить психометрические профили пользователей:
    - В статике: по всем постам
    - В динамике: по годовым интервалам

## Ход работы

### Библиотеки и глобальные константы

In [3]:
from pathlib import Path

import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go

from plotly.subplots import make_subplots
from transformers import BertForSequenceClassification, AutoTokenizer

In [4]:
DATA_DIR = Path('./data/psycho')

### Данные

Для заданных пользователей в скрипте parsing.py была собрана их пользовательская информация и пользовательские посты.

In [9]:
users = pd.read_csv(DATA_DIR / 'users.csv')

users.fillna('', inplace=True)

users

Unnamed: 0,id,domain,first_name,last_name,status,about
0,90015688,geneza1,Святослав,Стародуб,Ain' t no rest for the wicked,
1,339819489,biokita,Никита,Кузнецов,великие знания порождают великую скорбь,Я программист
2,212598813,n7weirdo,Анастасия,Петрушкина,Something’s changed in your face,
3,204738925,mamonov5,Максим,Некрылов,Ответы?,
4,89644563,0b1t_322,Даниил,Демин,Тебя что в гугле забанили,
5,295508700,oldwood20,Иван,Никитин,,
6,451075313,zodiak1990,Владислав,Гедз,Без любви наше сердце лишь часы,
7,173920498,getnocloser,Радж,Полянцев,last dance,
8,213930403,gabchanskiu,Глеб,Игумнов,,
9,157971403,ukropmolodoi,David,Gvasalia,Вуконг Прайм,


In [5]:
posts = pd.read_csv(DATA_DIR / 'posts.csv')

posts.text.fillna('', inplace=True)
posts.reposted_text.fillna('', inplace=True)

posts['dt'] = pd.to_datetime(posts['timestamp'], unit='s')

posts

Unnamed: 0,id,text,from_id,owner_id,timestamp,comment_count,reposted_text,dt
0,3273,,90015688,90015688,1599489222,0,"Одни поклонялись солнцу, другие боготворили лу...",2020-09-07 14:33:42
1,3272,,90015688,90015688,1592246372,0,Летние #ЕжедневныеНаградыLOL уже ждут вас в кл...,2020-06-15 18:39:32
2,3271,,90015688,90015688,1581675983,0,Рады представить новый турнир КОРОЛЕВСКАЯ БИТВА 👑,2020-02-14 10:26:23
3,3270,,90015688,90015688,1574024845,0,"⚠Хей, у нас НОВОСТИ!!! \nПривет путник, мы при...",2019-11-17 21:07:25
4,3269,,90015688,90015688,1573895758,0,#Hearthstone Конкурс! 💥 \nПолучи Пакет «Натиск...,2019-11-16 09:15:58
...,...,...,...,...,...,...,...,...
2225,9,"Арр! Я получил 1 уровень в игре ""Вампиры"". Кто...",90015688,90015688,1288255976,0,,2010-10-28 08:52:56
2226,8,"Теперь я - ""Избранник Крови I уровня"" в игре ""...",90015688,90015688,1288255920,0,,2010-10-28 08:52:00
2227,3,я Вагнер Лав хаха,90015688,90015688,1288115539,0,,2010-10-26 17:52:19
2228,2,"Привет, Святослав! Я узнал, Какой ты футболист...",89826923,90015688,1286210217,0,,2010-10-04 16:36:57


Распределение количество постов по пользователям

In [7]:
posts.owner_id.value_counts()

90015688     1896
339819489     101
212598813      77
451075313      52
173920498      23
295508700      19
213930403      17
157971403      16
89644563       13
204738925       9
157938870       7
Name: owner_id, dtype: int64

### Извлечение эмоциональной направленности из текстов пользователя

Для извлечения будет использоватся модель **cointegrated/rubert-tiny2-cedr-emotion-detection**.

Данная модель определяет следующие эмоции:
 - no emotion
 - joy
 - sadness
 - surprise
 - fear
 - anger

Для начала извлечём эмоциональные направленности текстов постов пользователей.

In [10]:
tokenizer = AutoTokenizer.from_pretrained('cointegrated/rubert-tiny2-cedr-emotion-detection')
model = BertForSequenceClassification.from_pretrained('cointegrated/rubert-tiny2-cedr-emotion-detection')

In [11]:
labels = list(model.config.id2label.values())

labels

['no_emotion', 'joy', 'sadness', 'surprise', 'fear', 'anger']

In [12]:
def text_to_emotions_probs(text):
    if not text:
        return np.zeros((1, 6))

    enc = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    out = model(**enc)
    logits = out.logits.detach().numpy()
    odds = np.exp(logits)
    return (odds / (1 + odds))

In [13]:
posts_texts = posts.text.tolist()

posts_texts_emotions = [text_to_emotions_probs(text) for text in posts_texts]
posts_texts_emotions = np.concatenate(posts_texts_emotions, axis=0)

texts_emotions = pd.DataFrame(posts_texts_emotions, columns=labels)

texts_emotions


Unnamed: 0,no_emotion,joy,sadness,surprise,fear,anger
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...
2225,0.062843,0.022415,0.002596,0.749405,0.063894,0.097042
2226,0.572088,0.008883,0.001178,0.158655,0.019917,0.031954
2227,0.974816,0.013609,0.002528,0.004640,0.003987,0.007585
2228,0.016983,0.889435,0.003846,0.389407,0.026945,0.014734


In [14]:
posts = pd.concat([posts, texts_emotions], axis=1)

posts

Unnamed: 0,id,text,from_id,owner_id,timestamp,comment_count,reposted_text,dt,no_emotion,joy,sadness,surprise,fear,anger
0,3273,,90015688,90015688,1599489222,0,"Одни поклонялись солнцу, другие боготворили лу...",2020-09-07 14:33:42,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,3272,,90015688,90015688,1592246372,0,Летние #ЕжедневныеНаградыLOL уже ждут вас в кл...,2020-06-15 18:39:32,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,3271,,90015688,90015688,1581675983,0,Рады представить новый турнир КОРОЛЕВСКАЯ БИТВА 👑,2020-02-14 10:26:23,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,3270,,90015688,90015688,1574024845,0,"⚠Хей, у нас НОВОСТИ!!! \nПривет путник, мы при...",2019-11-17 21:07:25,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,3269,,90015688,90015688,1573895758,0,#Hearthstone Конкурс! 💥 \nПолучи Пакет «Натиск...,2019-11-16 09:15:58,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2225,9,"Арр! Я получил 1 уровень в игре ""Вампиры"". Кто...",90015688,90015688,1288255976,0,,2010-10-28 08:52:56,0.062843,0.022415,0.002596,0.749405,0.063894,0.097042
2226,8,"Теперь я - ""Избранник Крови I уровня"" в игре ""...",90015688,90015688,1288255920,0,,2010-10-28 08:52:00,0.572088,0.008883,0.001178,0.158655,0.019917,0.031954
2227,3,я Вагнер Лав хаха,90015688,90015688,1288115539,0,,2010-10-26 17:52:19,0.974816,0.013609,0.002528,0.004640,0.003987,0.007585
2228,2,"Привет, Святослав! Я узнал, Какой ты футболист...",89826923,90015688,1286210217,0,,2010-10-04 16:36:57,0.016983,0.889435,0.003846,0.389407,0.026945,0.014734


### Статические профили пользователей

In [15]:
user_emotions = posts[['owner_id', *[label for label in labels]]].groupby('owner_id').mean().reset_index()

user_emotions

Unnamed: 0,owner_id,no_emotion,joy,sadness,surprise,fear,anger
0,89644563,0.030014,0.068093,0.020134,0.021619,0.011241,0.052356
1,90015688,0.050604,0.070144,0.03456,0.094034,0.012539,0.034427
2,157938870,0.574068,0.303918,0.004185,0.03361,0.011059,0.160649
3,157971403,0.180897,0.015992,0.023384,0.094342,0.025262,0.068464
4,173920498,0.391641,0.161572,0.052305,0.036861,0.01336,0.031061
5,204738925,0.242719,0.24204,0.091838,0.024137,0.019925,0.015613
6,212598813,0.169518,0.161991,0.012731,0.055914,0.029641,0.059835
7,213930403,0.025473,0.810044,0.013615,0.072344,0.011205,0.017987
8,295508700,0.093885,0.090711,0.004,0.052107,0.027095,0.096335
9,339819489,0.113289,0.041597,0.002436,0.006952,0.014707,0.028233


In [142]:
user_emotions_for_rose = []
for row in user_emotions.itertuples():
    for emotion_name in labels:
        user_emotions_for_rose.append((getattr(row, 'owner_id'), emotion_name, getattr(row, emotion_name)))
    
user_emotions_for_rose = pd.DataFrame(user_emotions_for_rose, columns=['user_id', 'emotion', 'value'])

user_emotions_for_rose

Unnamed: 0,user_id,emotion,value
0,89644563,no_emotion,0.030014
1,89644563,joy,0.068093
2,89644563,sadness,0.020134
3,89644563,surprise,0.021619
4,89644563,fear,0.011241
...,...,...,...
61,451075313,joy,0.012969
62,451075313,sadness,0.000634
63,451075313,surprise,0.014919
64,451075313,fear,0.001057


Теперь построим интерактивную розу эмоций для пользователей.

Так как значение вероятности для каждой эмоции в среднем получились небольшими, для визуализации были использованы логарифмические шкалы.

Если нужно посмотреть для отдельного пользователя, то нажатием мыши по ID пользователя можно отключить отображение остальных.

In [153]:
fig = px.line_polar(
    user_emotions_for_rose, 
    r='value', 
    log_r=True,
    theta='emotion', 
    color='user_id', 
    title='Статические профили пользователей',
    template='plotly_dark',
    width=800,
    height=800,
)

fig.show()

### Динамические профили пльзователей

Посмотрим на психометрические профили пользователей в динамике: по годам.

In [175]:
posts['year'] = posts.dt.dt.year

user_emotions_by_year = posts[['owner_id', 'year', *[label for label in labels]]].groupby(['owner_id', 'year']).mean().reset_index()

user_emotions_by_year_for_graph = []
for row in user_emotions_by_year.itertuples():
    for emotion_name in labels:
        user_emotions_by_year_for_graph.append((getattr(row, 'owner_id'), getattr(row, 'year'), emotion_name, getattr(row, emotion_name)))
    
user_emotions_by_year_for_graph = pd.DataFrame(user_emotions_by_year_for_graph, columns=['user_id', 'year', 'emotion', 'value'])

user_emotions_by_year_for_graph.sort_values(by='year', inplace=True)

user_emotions_by_year_for_graph

Unnamed: 0,user_id,year,emotion,value
0,89644563,2016,no_emotion,0.005246
1,89644563,2016,joy,0.018107
2,89644563,2016,sadness,0.207682
3,89644563,2016,surprise,0.096601
4,89644563,2016,fear,0.030258
...,...,...,...,...
355,451075313,2020,joy,0.000000
356,451075313,2020,sadness,0.000000
357,451075313,2020,surprise,0.000000
358,451075313,2020,fear,0.000000


In [210]:
fig = px.line(
    user_emotions_by_year_for_graph, 
    x='year', 
    y='value', 
    log_y=True,
    markers=True,
    color='user_id', 
    facet_col_wrap=1,
    facet_col='emotion',
)

fig.update_layout(
    height=1600, width=1400,
    title_text='Динамические профили пользователей по годам',
)

fig.show()