# 数据处理/Data Processing

在这个jupyter里，我们会对爬虫得到的数据集`Data_SinaWeibo.csv`进行处理，大致包括下列几步
- 初步的数据处理，规范数据类型
- 建立用户特征数据库
- 简单的处理文本

我们对于每一个代码块提供了详细的解释（中英文双语版）；请原谅英语中可能存在的小小的语法错误:) 您可以参考中文来理解

In this Jupyter file, we will process the dataset `Data_SinaWeibo.csv` obtained by the Python Script. The processing includes steped listed below:
- Preliminary process the data to normalize the data type
- Create a user features database
- Preliminary process the textual data
- Re-organize the dataset to the form for Heterogeneous Graph construction

We have provided detialed explanations for each code cell (in both English and Chinese); please forgive any minor grammatical errors in English texts :) You can refer to the Chinese one to understand it.

# 0. 读取/Read the crawled data and Python library

In [1]:
'''Basic'''
import os
import time
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from datetime import datetime
import re
import emoji
import multiprocessing
import plotly.express as px

from kmodes.kprototypes import KPrototypes

from lightgbm import LGBMClassifier
import shap
from sklearn.model_selection import cross_val_score
from bs4 import BeautifulSoup

'''sklearn'''
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE

'''NLP: Hugging face'''
from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import BertTokenizer, BertModel
# from transformers import AutoModelForSequenceClassification
# from torch import nn

import torch
torch.manual_seed(0)
np.random.seed(0)

'''Hetero graph'''
from torch_geometric.data import HeteroData

'''Training the dataset'''
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# from snownlp import SnowNLP

In [2]:
# 确认transformers库的版本/Check the transformers version
import transformers
transformers.__version__

'4.31.0'

In [3]:
# 设置seaborn可视化参数/Set the seaborn visualization parameters
sns.set(style="darkgrid")
sns.set_context("notebook",
                rc={"xtick.labelsize": 14,
                    "ytick.labelsize": 14,
                    "axes.labelsize": 14,
                    "axes.titlesize": 18,
                    "legend.fontsize": 14,
                    "legend.title_fontsize":14})

In [4]:
# 读取原始数据集/Read the data (plz change the path if necessary)
raw_data=pd.read_csv('Datasets/Data_SinaWeibo.csv')
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   CommentID        3850 non-null   int64 
 1   CommentTime      3850 non-null   object
 2   RootID           3850 non-null   int64 
 3   CommentRaw       3850 non-null   object
 4   Comment          3829 non-null   object
 5   CommentLike      3850 non-null   int64 
 6   CommentReply     3850 non-null   object
 7   UserID           3850 non-null   int64 
 8   UserName         3850 non-null   object
 9   UserLocation     3850 non-null   object
 10  UserDescription  2366 non-null   object
 11  UserGender       3850 non-null   object
 12  UserFan          3850 non-null   int64 
 13  UserFollow       3850 non-null   int64 
 14  UserWeibo        3850 non-null   int64 
 15  UserVerified     3850 non-null   bool  
 16  UserFanTitle     3850 non-null   object
 17  VipRank          3850 non-null   

In [5]:
# 复制数据/Copy the data
data=raw_data.copy()

# 1. 初步的数据了解&处理/Preliminary data exploration and processing

合计2803个user，3850评论/回复。<br>
后续如果要探查user相关特征/user profile的分布情况与pattern，需提取出用户数据库。

There are 2803 users and 3850 comments/replies in total.<br>
If exploring user-related features/user profiles and the corresponding patterns are necessary, user database should be extracted from the raw dataset ``data``

In [6]:
len(data['UserName'].value_counts())

2803

In [7]:
len(data['UserID'].value_counts())

2803

依次对每个feature进行encoding。

Encode each features (if necessary)

In [8]:
data.head(20)

Unnamed: 0,CommentID,CommentTime,RootID,CommentRaw,Comment,CommentLike,CommentReply,UserID,UserName,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,UserFanTitle,VipRank
0,4915861950828535,Fri Jun 23 18:31:24 +0800 2023,0,考古,考古,0,0,7752387333,祎只狸猫,其他,橘圈小透明阮淼祎,f,45,149,260,False,loyal_fans,0
1,4910942133158267,Sat Jun 10 04:41:48 +0800 2023,0,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,0,0,7604058796,邮一棵草莓i,其他,垃圾滚！！！,f,2,48,124,False,Null,0
2,4862811883439433,Sat Jan 28 09:09:22 +0800 2023,0,真不知道迅哥给评论区投了多少米？[允悲][doge],真不知道迅哥给评论区投了多少米？,0,0,7577283965,kmimg7,甘肃 庆阳,,m,0,31,4,False,Null,0
3,4862705678417948,Sat Jan 28 02:07:20 +0800 2023,0,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗,0,0,7476902376,小祀弟弟吖,其他,,f,0,39,2,False,Null,0
4,4858911816944130,Tue Jan 17 14:51:54 +0800 2023,0,[吃瓜]现在是2023年 回来考古的点个赞,现在是2023年 回来考古的点个赞,0,0,7724649649,不知道如何评价,其他,,m,0,55,0,False,Null,0
5,4857585939256244,Fri Jan 13 23:03:20 +0800 2023,0,考古,考古,0,0,6317060050,Tawil-at_Umr,其他,,f,1,71,27,False,Null,0
6,4843708346798597,Tue Dec 06 15:58:44 +0800 2022,0,就搞不懂了，怎么有些人这么愤世嫉俗[doge]那他平日得有多大成就啊[泪],就搞不懂了，怎么有些人这么愤世嫉俗那他平日得有多大成就啊,0,0,5238123879,我原来就是小黑子,山东 烟台,他只看见外表的结果，而这种结果却已给他很深的印象了。,m,10,67,299,False,loyal_fans,0
7,4842368296289558,Fri Dec 02 23:13:51 +0800 2022,0,考古结束@Akoi,考古结束@Akoi,0,0,7752939212,花月歌浮舟,其他,,f,60,122,1,False,Null,0
8,4836112617182567,Tue Nov 15 16:56:01 +0800 2022,0,Windows 11“狗都不用”时代前来考古[doge],Windows 11“狗都不用”时代前来考古,0,0,5952984271,罗小黑本喵_,浙江,A cat spirit who's Windows Insider. 现已镇魂。,m,186,1642,12478,True,loyal_fans,6
9,4832608544361987,Sun Nov 06 00:52:05 +0800 2022,0,两年了，回过头来看[打call][打call][打call][打call],两年了，回过头来看,9,1,7776168152,养着兔子的猫咪,上海,,f,0,61,0,False,Null,0


## 1.1 检查CommentID, UserID的制表符/Check the tabular character ``tab`` in CommentID, UserID

脚本爬取导致的制表符已经消失，这2个变量无需额外处理。

The tabular character caused by Python Script disappeared. No additional processing are required on these 2 features.

In [9]:
type(data['CommentID'][0])

numpy.int64

In [10]:
data['CommentID'][0]

4915861950828535

In [11]:
type(data['UserID'][0])

numpy.int64

In [12]:
data['UserID'][0]

7752387333

## 1.2 CommentTime格式和数据类型/The formate and data type of CommentTime

目前是str类型，而且很长；转换为datetime的数据类型

The data type is str as for now and each record is very very long. Transform the CommentTime to ``datetime`` data type

In [13]:
data['CommentTime']

0       Fri Jun 23 18:31:24 +0800 2023
1       Sat Jun 10 04:41:48 +0800 2023
2       Sat Jan 28 09:09:22 +0800 2023
3       Sat Jan 28 02:07:20 +0800 2023
4       Tue Jan 17 14:51:54 +0800 2023
                     ...              
3845    Sat Jun 08 12:02:17 +0800 2019
3846    Sat Jun 08 12:02:15 +0800 2019
3847    Sat Jun 08 12:01:51 +0800 2019
3848    Sat Jun 08 12:01:18 +0800 2019
3849    Sat Jun 08 12:00:47 +0800 2019
Name: CommentTime, Length: 3850, dtype: object

使用lambda函数来对CommentTime进行转换/Use lambda function to transform the whole column ``CommentTime``

In [14]:
# 我们需要的格式类型/The format we need
format_string = "%a %b %d %H:%M:%S %z %Y"
data['CommentTime']=data['CommentTime'].apply(lambda x: datetime.strptime(x, format_string))
data['CommentTime']

0      2023-06-23 18:31:24+08:00
1      2023-06-10 04:41:48+08:00
2      2023-01-28 09:09:22+08:00
3      2023-01-28 02:07:20+08:00
4      2023-01-17 14:51:54+08:00
                  ...           
3845   2019-06-08 12:02:17+08:00
3846   2019-06-08 12:02:15+08:00
3847   2019-06-08 12:01:51+08:00
3848   2019-06-08 12:01:18+08:00
3849   2019-06-08 12:00:47+08:00
Name: CommentTime, Length: 3850, dtype: datetime64[ns, UTC+08:00]

## 1.3 CommentReply数据类型为object/The data type of CommentReply is Object

CommentReply的值为Null的均为“回复”（即对于该条帖子下、某条评论的回复。后以“回复”代称）<br>
回复的RootID均为0

Data with ``Null`` as the CommentReply's value is known as "reply": i.e., it is a respond to the comment left under the target post (Genshin). We will use "reply" to  represent this type of comments. <br>
We can find the RootID of reply is all 0.

In [15]:
data['CommentReply'].value_counts().sort_index()

CommentReply
0       2461
1         79
10         3
108        1
11         2
111        1
12         2
123        1
14         1
15         1
17         1
2         41
20         1
207        1
21         2
24         1
26         1
29         1
3         25
30         1
31         1
325        1
4         10
40         1
42         1
45         1
5          1
53         1
6          7
7          7
75         1
8          2
9          3
Null    1186
Name: count, dtype: int64

In [16]:
data[data['CommentReply']=='Null']['RootID'].isnull().sum()

0

需要找到一个办法把数字转变为int，将Null字符转换为真正的空值

We need to find a way to transform number from str to int, and convert ``Null`` str to the NA (the null value in Python)

In [17]:
data['CommentReply']=data['CommentReply'].replace('Null',pd.NA)
data['CommentReply'].isnull().sum()

1186

In [18]:
data['CommentReply']=data['CommentReply'].astype('Int64')
data['CommentReply'].dtypes

Int64Dtype()

In [19]:
data['CommentReply'].isnull().sum()

1186

In [20]:
# 完成/Finish
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype                    
---  ------           --------------  -----                    
 0   CommentID        3850 non-null   int64                    
 1   CommentTime      3850 non-null   datetime64[ns, UTC+08:00]
 2   RootID           3850 non-null   int64                    
 3   CommentRaw       3850 non-null   object                   
 4   Comment          3829 non-null   object                   
 5   CommentLike      3850 non-null   int64                    
 6   CommentReply     2664 non-null   Int64                    
 7   UserID           3850 non-null   int64                    
 8   UserName         3850 non-null   object                   
 9   UserLocation     3850 non-null   object                   
 10  UserDescription  2366 non-null   object                   
 11  UserGender       3850 non-null   object                 

仅供检查用

Only for double check

In [21]:
data['CommentReply'].isnull().sum()

1186

In [22]:
data['CommentReply'][0]

0

## 1.4 UserLocation：将省和市区分开/UserLocation: sperate the province and city/area

UserLocation的信息非常的不规范。将省和市区分开保存为单独的2列

The value of UserLocation is non-uniform. Seperate the province and city/area and then save into 2 new columns

In [23]:
data['UserLocation'].value_counts()

UserLocation
其他         1062
上海          130
海外          120
江苏          102
广东 广州        96
           ... 
四川 眉山         1
河南 焦作         1
上海 松江区        1
内蒙古 赤峰        1
海外 爱沙尼亚       1
Name: count, Length: 298, dtype: int64

我们也保存了原始列

We also keep the original column UserLocation

In [24]:
data.insert(9,'Province',data['UserLocation'].str.split(' ', n=1, expand=True).iloc[:,0])
data.insert(10,'Region',data['UserLocation'].str.split(' ', n=1, expand=True).iloc[:,1])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype                    
---  ------           --------------  -----                    
 0   CommentID        3850 non-null   int64                    
 1   CommentTime      3850 non-null   datetime64[ns, UTC+08:00]
 2   RootID           3850 non-null   int64                    
 3   CommentRaw       3850 non-null   object                   
 4   Comment          3829 non-null   object                   
 5   CommentLike      3850 non-null   int64                    
 6   CommentReply     2664 non-null   Int64                    
 7   UserID           3850 non-null   int64                    
 8   UserName         3850 non-null   object                   
 9   Province         3850 non-null   object                   
 10  Region           1694 non-null   object                   
 11  UserLocation     3850 non-null   object                 

In [25]:
data['Province'].value_counts()

Province
其他     1062
广东      315
江苏      276
海外      252
上海      234
北京      202
浙江      191
四川      187
山东      106
湖北       90
福建       87
湖南       77
河南       75
重庆       68
辽宁       64
安徽       64
广西       61
天津       48
河北       48
吉林       47
江西       47
山西       38
陕西       37
云南       36
黑龙江      34
内蒙古      24
新疆       15
贵州       14
香港       13
海南       12
甘肃        9
宁夏        7
青海        5
澳门        4
西藏        1
Name: count, dtype: int64

大多数的人并未填写Region，因此Region将被放置在一边不做进一步处理

Most of users are with a null value in Region. Thus we will not further handle this variable

In [26]:
data['Region'].isnull().sum()

2156

使用labelencoder来对Province进行标记; 下面list的顺序就是label的顺序：index从0~34

Use LabelEncoder to encode the Province value. Each index (from 0 to 34) and the corresponding province name are also shown below.

In [27]:
le_province=LabelEncoder()
province_cate=le_province.fit_transform(data['Province'])
print(le_province.classes_)
print('\n','Total number of Provinces: ', len(le_province.classes_))

['上海' '云南' '其他' '内蒙古' '北京' '吉林' '四川' '天津' '宁夏' '安徽' '山东' '山西' '广东' '广西'
 '新疆' '江苏' '江西' '河北' '河南' '浙江' '海南' '海外' '湖北' '湖南' '澳门' '甘肃' '福建' '西藏'
 '贵州' '辽宁' '重庆' '陕西' '青海' '香港' '黑龙江']

 Total number of Provinces:  35


In [28]:
data.insert(10,'ProvinceCode',province_cate)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype                    
---  ------           --------------  -----                    
 0   CommentID        3850 non-null   int64                    
 1   CommentTime      3850 non-null   datetime64[ns, UTC+08:00]
 2   RootID           3850 non-null   int64                    
 3   CommentRaw       3850 non-null   object                   
 4   Comment          3829 non-null   object                   
 5   CommentLike      3850 non-null   int64                    
 6   CommentReply     2664 non-null   Int64                    
 7   UserID           3850 non-null   int64                    
 8   UserName         3850 non-null   object                   
 9   Province         3850 non-null   object                   
 10  ProvinceCode     3850 non-null   int32                    
 11  Region           1694 non-null   object                 

In [29]:
data['ProvinceCode']

0        2
1        2
2       25
3        2
4        2
        ..
3845     2
3846     2
3847    31
3848     0
3849     0
Name: ProvinceCode, Length: 3850, dtype: int32

## 1.5 UserGender：编码值/UserGender: Map the originial value to the binary one

In [30]:
data['UserGender'].value_counts()

UserGender
m    2943
f     907
Name: count, dtype: int64

简单的映射一下：m->1 f->0

Map the value: m (male) -> 1; f (female) -> 0

In [31]:
gender_code={'m':1,'f':0}
data['UserGender']=data['UserGender'].map(gender_code)
data['UserGender'].value_counts()

UserGender
1    2943
0     907
Name: count, dtype: int64

## 1.6 UserVerified：编码值/UserVerified: Map the original value to the binary one
False表示没有认证，映射为0；True表示认证，映射为1。

False: unverified user, mapped to 0; True: verified user, mapped to 1

In [32]:
data['UserVerified'].value_counts()

UserVerified
False    3685
True      165
Name: count, dtype: int64

In [33]:
data['UserVerified'].astype(int).value_counts()

UserVerified
0    3685
1     165
Name: count, dtype: int64

确保数据类型为int

Make sure the data type is int

In [34]:
data['UserVerified']=data['UserVerified'].astype(int)
data['UserVerified']

0       0
1       0
2       0
3       0
4       0
       ..
3845    0
3846    0
3847    0
3848    0
3849    0
Name: UserVerified, Length: 3850, dtype: int32

## 1.7 UserFanTitle：改名与映射/UserFanTitle: Rename and map 

很多人其实都不是铁粉；头衔仅有"铁粉"和"无"这俩个选项。因此重命名为LoyalFan后，再映射：loyal_fans -> 1, Null -> 0

Most of users are not Loyal Fan; And there are only 2 values of this variable. Therefore, rename UserFanTitle to LoyalFan, and then map the value: loyal_fans -> 1, Null -> 0

In [35]:
data['UserFanTitle'].value_counts()

UserFanTitle
Null          3563
loyal_fans     287
Name: count, dtype: int64

In [36]:
data.rename(columns={'UserFanTitle': 'LoyalFan'}, inplace=True)
fan_code={'loyal_fans':1,'Null':0}
data['LoyalFan']=data['LoyalFan'].map(fan_code)
data['LoyalFan'].value_counts()

LoyalFan
0    3563
1     287
Name: count, dtype: int64

## 1.8 初步处理总览/Overview of preliminary processing result

对于下列4个有空值的变量的解释：
- Comment：正常，这个是去除了emoji和emoticon之后的纯文本。comment raw则是包含了全部信息的。在后续的时候直接用comment raw即可（会进行简单的处理）
- CommentReply：正常，空值表示这些都是"回复"。对应的RootID全都不为0
- Region：已在Section 1.4 讨论过
- UserDescription：正常，表明该用户未留下个人描述

Explanation for the 4 variables with null values:
- Comment: As expected; This feature is the comment (plain text) after removing the emojis and emoticons. CommentRaw is the raw one and keeps all emojis/emotions. We will use CommentRaw in the following analysis (text processing if necessary)
- CommentReply: As expected; Null values represent "reply" discussed above. The corresponding values of RootID are all non-zero.
- Region: Has discussed in the Section 1.4
- UserDescription: As expected; A null value indicates that this user does not leave description/bio in his/her profile.

In [37]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype                    
---  ------           --------------  -----                    
 0   CommentID        3850 non-null   int64                    
 1   CommentTime      3850 non-null   datetime64[ns, UTC+08:00]
 2   RootID           3850 non-null   int64                    
 3   CommentRaw       3850 non-null   object                   
 4   Comment          3829 non-null   object                   
 5   CommentLike      3850 non-null   int64                    
 6   CommentReply     2664 non-null   Int64                    
 7   UserID           3850 non-null   int64                    
 8   UserName         3850 non-null   object                   
 9   Province         3850 non-null   object                   
 10  ProvinceCode     3850 non-null   int32                    
 11  Region           1694 non-null   object                 

# 2. 用户数据/User profile and database

用户的信息将被单独储存在一个新表。除了用户profile的信息外，将新增相关行：用户在该帖子下的评论数、收到的点赞数等<br>
同时对UserDescription进行了简单的特征工程<br>
2.1~2.7是数据处理；可以直接运行2.8读取处理后的数据集

User information will be extracted and saved into a new dataframe. New columns related to user behaviors (e.g., the number of comments and received `likes`) under the target post are added<br>
Basic feature engineerings are also conducted on UserDescription<br>
All the data processings are shown in Section 2.1 ~ 2.7. <span style="color:red">If you want to read the processed data directly, turn to Section 2.8</span>

## 2.1 拆分用户数据/Split out the user database

In [38]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype                    
---  ------           --------------  -----                    
 0   CommentID        3850 non-null   int64                    
 1   CommentTime      3850 non-null   datetime64[ns, UTC+08:00]
 2   RootID           3850 non-null   int64                    
 3   CommentRaw       3850 non-null   object                   
 4   Comment          3829 non-null   object                   
 5   CommentLike      3850 non-null   int64                    
 6   CommentReply     2664 non-null   Int64                    
 7   UserID           3850 non-null   int64                    
 8   UserName         3850 non-null   object                   
 9   Province         3850 non-null   object                   
 10  ProvinceCode     3850 non-null   int32                    
 11  Region           1694 non-null   object                 

提取每个user的ID在data中所对应的索引

Extract the dataframe index of each user (indetified by UserID)

In [39]:
userID_index=data.groupby('UserID').groups
user_data=pd.DataFrame(columns=['UserID', 'FirstIndex'])
for user_id, indices in userID_index.items():
    new_row = pd.DataFrame({'UserID': [user_id], 'FirstIndex': [indices[0]]})
    user_data=pd.concat([user_data, new_row], ignore_index=True)
user_data

Unnamed: 0,UserID,FirstIndex
0,1001914040,3498
1,1008309912,43
2,1025900974,1394
3,1028179843,2564
4,1035744261,1292
...,...,...
2798,7755717663,16
2799,7766444420,945
2800,7772408887,10
2801,7774567481,2188


通过merge得到user的其他数据。<br>
删除与user无关的feature 列（index2~8）。

Obtain other user features through `merge`<br>
Remove those columns (index 2~8) unrelated to user information

In [40]:
user_info=data.loc[list(user_data['FirstIndex'])]
user_data=pd.merge(user_data,user_info,on='UserID')
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype                    
---  ------           --------------  -----                    
 0   UserID           2803 non-null   object                   
 1   FirstIndex       2803 non-null   object                   
 2   CommentID        2803 non-null   int64                    
 3   CommentTime      2803 non-null   datetime64[ns, UTC+08:00]
 4   RootID           2803 non-null   int64                    
 5   CommentRaw       2803 non-null   object                   
 6   Comment          2789 non-null   object                   
 7   CommentLike      2803 non-null   int64                    
 8   CommentReply     2390 non-null   Int64                    
 9   UserName         2803 non-null   object                   
 10  Province         2803 non-null   object                   
 11  ProvinceCode     2803 non-null   int32                  

In [41]:
col_to_drop=list(user_data.columns[2:9])
col_to_drop

['CommentID',
 'CommentTime',
 'RootID',
 'CommentRaw',
 'Comment',
 'CommentLike',
 'CommentReply']

In [42]:
user_data=user_data.drop(col_to_drop, axis=1, inplace=False)
user_data

Unnamed: 0,UserID,FirstIndex,UserName,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,1001914040,3498,薪火鹏,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0
1,1008309912,43,提尔乌斯,上海,0,,上海,此处无人,1,0,0,0,0,0,0
2,1025900974,1394,猫的摇篮-伪物,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0
3,1028179843,2564,非常神奇的老z,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0
4,1035744261,1292,假装很强的萌新,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2798,7755717663,16,寂月海200007,其他,2,,其他,,1,0,10,2,0,0,0
2799,7766444420,945,烛虚cron,湖北,22,武汉,湖北 武汉,您诸位好哇(〜￣▽￣)〜,0,1,38,37,0,0,0
2800,7772408887,10,bo_白色大月亮,其他,2,,其他,,1,0,50,6,0,0,0
2801,7774567481,2188,你好陈博,安徽,9,宿州,安徽 宿州,,1,0,7,0,0,0,0


## 2.2 计算comment和reply/Calculate the number of comments and replies from each user

计算comment和reply数:
- RootID=0：为comment
- RootID不为0：为reply<br>
统计次数即可
<br>需要使用merge根据UserID来进行匹配

Calculate the frequency/occurence of comments and replies
- RootID = 0: Comment
- RootID ≠ 0: Reply <br>

In [43]:
# 评论/Comment
comment_data=data[data['RootID']==0]
comment_data=comment_data.groupby('UserID').count().reset_index().iloc[:,:2]
comment_data

Unnamed: 0,UserID,CommentID
0,1001914040,1
1,1008309912,4
2,1025900974,2
3,1035744261,1
4,1036072925,1
...,...,...
2393,7732193786,1
2394,7752387333,1
2395,7752939212,1
2396,7755717663,1


In [44]:
user_data=pd.merge(user_data,comment_data,how='left',on='UserID')
user_data

Unnamed: 0,UserID,FirstIndex,UserName,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,CommentID
0,1001914040,3498,薪火鹏,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0,1.0
1,1008309912,43,提尔乌斯,上海,0,,上海,此处无人,1,0,0,0,0,0,0,4.0
2,1025900974,1394,猫的摇篮-伪物,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0,2.0
3,1028179843,2564,非常神奇的老z,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0,
4,1035744261,1292,假装很强的萌新,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2798,7755717663,16,寂月海200007,其他,2,,其他,,1,0,10,2,0,0,0,1.0
2799,7766444420,945,烛虚cron,湖北,22,武汉,湖北 武汉,您诸位好哇(〜￣▽￣)〜,0,1,38,37,0,0,0,
2800,7772408887,10,bo_白色大月亮,其他,2,,其他,,1,0,50,6,0,0,0,
2801,7774567481,2188,你好陈博,安徽,9,宿州,安徽 宿州,,1,0,7,0,0,0,0,


In [45]:
# 回复/Reply
reply_data=data[data['RootID']!=0]
reply_data=reply_data.groupby('UserID').count().reset_index().iloc[:,:2]
reply_data

Unnamed: 0,UserID,CommentID
0,1028179843,1
1,1039477413,1
2,1046537351,2
3,1078938744,1
4,1174230992,3
...,...,...
540,7747956676,1
541,7752842167,1
542,7766444420,1
543,7772408887,4


In [46]:
user_data=pd.merge(user_data,reply_data,how='left',on='UserID')
user_data

Unnamed: 0,UserID,FirstIndex,UserName,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,CommentID_x,CommentID_y
0,1001914040,3498,薪火鹏,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0,1.0,
1,1008309912,43,提尔乌斯,上海,0,,上海,此处无人,1,0,0,0,0,0,0,4.0,
2,1025900974,1394,猫的摇篮-伪物,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0,2.0,
3,1028179843,2564,非常神奇的老z,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0,,1.0
4,1035744261,1292,假装很强的萌新,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2798,7755717663,16,寂月海200007,其他,2,,其他,,1,0,10,2,0,0,0,1.0,
2799,7766444420,945,烛虚cron,湖北,22,武汉,湖北 武汉,您诸位好哇(〜￣▽￣)〜,0,1,38,37,0,0,0,,1.0
2800,7772408887,10,bo_白色大月亮,其他,2,,其他,,1,0,50,6,0,0,0,,4.0
2801,7774567481,2188,你好陈博,安徽,9,宿州,安徽 宿州,,1,0,7,0,0,0,0,,1.0


In [47]:
# 重命名/Rename
user_data.rename(columns={'CommentID_x': 'Comment', 'CommentID_y': 'Reply'}, inplace=True)

In [48]:
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   UserID           2803 non-null   object 
 1   FirstIndex       2803 non-null   object 
 2   UserName         2803 non-null   object 
 3   Province         2803 non-null   object 
 4   ProvinceCode     2803 non-null   int32  
 5   Region           1277 non-null   object 
 6   UserLocation     2803 non-null   object 
 7   UserDescription  1735 non-null   object 
 8   UserGender       2803 non-null   int64  
 9   UserFan          2803 non-null   int64  
 10  UserFollow       2803 non-null   int64  
 11  UserWeibo        2803 non-null   int64  
 12  UserVerified     2803 non-null   int32  
 13  LoyalFan         2803 non-null   int64  
 14  VipRank          2803 non-null   int64  
 15  Comment          2398 non-null   float64
 16  Reply            545 non-null    float64
dtypes: float64(2),

将Comment和Reply的空值填为0

Replace null values from Comment and Reply with 0

In [49]:
com_null_idx=user_data[user_data['Comment'].isnull()].index
com_null_idx

Index([   3,    6,   11,   28,   46,   51,   70,   79,  100,  115,
       ...
       2782, 2784, 2785, 2786, 2791, 2794, 2796, 2799, 2800, 2801],
      dtype='int64', length=405)

In [50]:
user_data.loc[com_null_idx,'Comment']=0
user_data

Unnamed: 0,UserID,FirstIndex,UserName,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Comment,Reply
0,1001914040,3498,薪火鹏,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0,1.0,
1,1008309912,43,提尔乌斯,上海,0,,上海,此处无人,1,0,0,0,0,0,0,4.0,
2,1025900974,1394,猫的摇篮-伪物,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0,2.0,
3,1028179843,2564,非常神奇的老z,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0,0.0,1.0
4,1035744261,1292,假装很强的萌新,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2798,7755717663,16,寂月海200007,其他,2,,其他,,1,0,10,2,0,0,0,1.0,
2799,7766444420,945,烛虚cron,湖北,22,武汉,湖北 武汉,您诸位好哇(〜￣▽￣)〜,0,1,38,37,0,0,0,0.0,1.0
2800,7772408887,10,bo_白色大月亮,其他,2,,其他,,1,0,50,6,0,0,0,0.0,4.0
2801,7774567481,2188,你好陈博,安徽,9,宿州,安徽 宿州,,1,0,7,0,0,0,0,0.0,1.0


In [51]:
rpl_null_idx=user_data[user_data['Reply'].isnull()].index
user_data.loc[rpl_null_idx,'Reply']=0
user_data

Unnamed: 0,UserID,FirstIndex,UserName,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Comment,Reply
0,1001914040,3498,薪火鹏,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0,1.0,0.0
1,1008309912,43,提尔乌斯,上海,0,,上海,此处无人,1,0,0,0,0,0,0,4.0,0.0
2,1025900974,1394,猫的摇篮-伪物,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0,2.0,0.0
3,1028179843,2564,非常神奇的老z,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0,0.0,1.0
4,1035744261,1292,假装很强的萌新,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2798,7755717663,16,寂月海200007,其他,2,,其他,,1,0,10,2,0,0,0,1.0,0.0
2799,7766444420,945,烛虚cron,湖北,22,武汉,湖北 武汉,您诸位好哇(〜￣▽￣)〜,0,1,38,37,0,0,0,0.0,1.0
2800,7772408887,10,bo_白色大月亮,其他,2,,其他,,1,0,50,6,0,0,0,0.0,4.0
2801,7774567481,2188,你好陈博,安徽,9,宿州,安徽 宿州,,1,0,7,0,0,0,0,0.0,1.0


In [52]:
# 改变数据类型为int/Convert the data type to int
user_data[['Comment','Reply']]=user_data[['Comment','Reply']].astype(int)
user_data

Unnamed: 0,UserID,FirstIndex,UserName,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Comment,Reply
0,1001914040,3498,薪火鹏,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0,1,0
1,1008309912,43,提尔乌斯,上海,0,,上海,此处无人,1,0,0,0,0,0,0,4,0
2,1025900974,1394,猫的摇篮-伪物,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0,2,0
3,1028179843,2564,非常神奇的老z,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0,0,1
4,1035744261,1292,假装很强的萌新,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2798,7755717663,16,寂月海200007,其他,2,,其他,,1,0,10,2,0,0,0,1,0
2799,7766444420,945,烛虚cron,湖北,22,武汉,湖北 武汉,您诸位好哇(〜￣▽￣)〜,0,1,38,37,0,0,0,0,1
2800,7772408887,10,bo_白色大月亮,其他,2,,其他,,1,0,50,6,0,0,0,0,4
2801,7774567481,2188,你好陈博,安徽,9,宿州,安徽 宿州,,1,0,7,0,0,0,0,0,1


TotalComment：user合计发布了多少条文本（即Comment和Reply的总和）

TotalComment: the number of posts (i.e., the sum of Comment and Reply) from each user

In [53]:
user_data['TotalComment']=user_data['Comment']+user_data['Reply']
user_data

Unnamed: 0,UserID,FirstIndex,UserName,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Comment,Reply,TotalComment
0,1001914040,3498,薪火鹏,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0,1,0,1
1,1008309912,43,提尔乌斯,上海,0,,上海,此处无人,1,0,0,0,0,0,0,4,0,4
2,1025900974,1394,猫的摇篮-伪物,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0,2,0,2
3,1028179843,2564,非常神奇的老z,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0,0,1,1
4,1035744261,1292,假装很强的萌新,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2798,7755717663,16,寂月海200007,其他,2,,其他,,1,0,10,2,0,0,0,1,0,1
2799,7766444420,945,烛虚cron,湖北,22,武汉,湖北 武汉,您诸位好哇(〜￣▽￣)〜,0,1,38,37,0,0,0,0,1,1
2800,7772408887,10,bo_白色大月亮,其他,2,,其他,,1,0,50,6,0,0,0,0,4,4
2801,7774567481,2188,你好陈博,安徽,9,宿州,安徽 宿州,,1,0,7,0,0,0,0,0,1,1


核查是否加总为总评论数

Double check

In [54]:
user_data['TotalComment'].sum()

3850

改变列的顺序使其更加可读

Change the order of column to make the dataframe more readable

In [55]:
tc=user_data['TotalComment']
c=user_data['Comment']
r=user_data['Reply']
col_drop=['TotalComment','Comment','Reply']
user_data.drop(col_drop, axis=1, inplace=True)
user_data.insert(3,'TotalComment',tc)
user_data.insert(4,'Comment',c)
user_data.insert(5,'Reply',r)
user_data

Unnamed: 0,UserID,FirstIndex,UserName,TotalComment,Comment,Reply,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,1001914040,3498,薪火鹏,1,1,0,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0
1,1008309912,43,提尔乌斯,4,4,0,上海,0,,上海,此处无人,1,0,0,0,0,0,0
2,1025900974,1394,猫的摇篮-伪物,2,2,0,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0
3,1028179843,2564,非常神奇的老z,1,0,1,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0
4,1035744261,1292,假装很强的萌新,1,1,0,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2798,7755717663,16,寂月海200007,1,1,0,其他,2,,其他,,1,0,10,2,0,0,0
2799,7766444420,945,烛虚cron,1,0,1,湖北,22,武汉,湖北 武汉,您诸位好哇(〜￣▽￣)〜,0,1,38,37,0,0,0
2800,7772408887,10,bo_白色大月亮,4,0,4,其他,2,,其他,,1,0,50,6,0,0,0
2801,7774567481,2188,你好陈博,1,0,1,安徽,9,宿州,安徽 宿州,,1,0,7,0,0,0,0


## 2.3 额外的索引列/Save the original dataframe index of each record

储存了原始数据集里，每一个user所发布的所有评论的index

Save dataframe indices for comments from each user as a new column `IndexList`.

In [56]:
user_idx=[]
for i in userID_index.values():
    user_idx.append(list(i))
user_idx=pd.Series(user_idx)
user_idx

0                     [3498]
1       [43, 149, 565, 1344]
2               [1394, 1686]
3                     [2564]
4                     [1292]
                ...         
2798                    [16]
2799                   [945]
2800      [10, 22, 158, 946]
2801                  [2188]
2802                     [9]
Length: 2803, dtype: object

In [57]:
user_data.insert(2,'IndexList',user_idx)
user_data

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,1001914040,3498,[3498],薪火鹏,1,1,0,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0
1,1008309912,43,"[43, 149, 565, 1344]",提尔乌斯,4,4,0,上海,0,,上海,此处无人,1,0,0,0,0,0,0
2,1025900974,1394,"[1394, 1686]",猫的摇篮-伪物,2,2,0,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0
3,1028179843,2564,[2564],非常神奇的老z,1,0,1,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0
4,1035744261,1292,[1292],假装很强的萌新,1,1,0,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2798,7755717663,16,[16],寂月海200007,1,1,0,其他,2,,其他,,1,0,10,2,0,0,0
2799,7766444420,945,[945],烛虚cron,1,0,1,湖北,22,武汉,湖北 武汉,您诸位好哇(〜￣▽￣)〜,0,1,38,37,0,0,0
2800,7772408887,10,"[10, 22, 158, 946]",bo_白色大月亮,4,0,4,其他,2,,其他,,1,0,50,6,0,0,0
2801,7774567481,2188,[2188],你好陈博,1,0,1,安徽,9,宿州,安徽 宿州,,1,0,7,0,0,0,0


## 2.4 用户收到的点赞数/The number of likes received from each user
统计每个user的文本/post（即包含Comment和Reply）合计收到了多少个赞

Count the number of likes each user's texts/posts (i.e., Comment and Reply) received in total

In [58]:
user_likeCount = data.groupby('UserID')['CommentLike'].sum().reset_index()
user_likeCount

Unnamed: 0,UserID,CommentLike
0,1001914040,0
1,1008309912,0
2,1025900974,1
3,1028179843,0
4,1035744261,0
...,...,...
2798,7755717663,3
2799,7766444420,0
2800,7772408887,0
2801,7774567481,2


In [59]:
user_data = pd.merge(user_data,user_likeCount, how='left', on='UserID')
user_data

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,CommentLike
0,1001914040,3498,[3498],薪火鹏,1,1,0,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0,0
1,1008309912,43,"[43, 149, 565, 1344]",提尔乌斯,4,4,0,上海,0,,上海,此处无人,1,0,0,0,0,0,0,0
2,1025900974,1394,"[1394, 1686]",猫的摇篮-伪物,2,2,0,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0,1
3,1028179843,2564,[2564],非常神奇的老z,1,0,1,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0,0
4,1035744261,1292,[1292],假装很强的萌新,1,1,0,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2798,7755717663,16,[16],寂月海200007,1,1,0,其他,2,,其他,,1,0,10,2,0,0,0,3
2799,7766444420,945,[945],烛虚cron,1,0,1,湖北,22,武汉,湖北 武汉,您诸位好哇(〜￣▽￣)〜,0,1,38,37,0,0,0,0
2800,7772408887,10,"[10, 22, 158, 946]",bo_白色大月亮,4,0,4,其他,2,,其他,,1,0,50,6,0,0,0,0
2801,7774567481,2188,[2188],你好陈博,1,0,1,安徽,9,宿州,安徽 宿州,,1,0,7,0,0,0,0,2


In [60]:
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   UserID           2803 non-null   object
 1   FirstIndex       2803 non-null   object
 2   IndexList        2803 non-null   object
 3   UserName         2803 non-null   object
 4   TotalComment     2803 non-null   int32 
 5   Comment          2803 non-null   int32 
 6   Reply            2803 non-null   int32 
 7   Province         2803 non-null   object
 8   ProvinceCode     2803 non-null   int32 
 9   Region           1277 non-null   object
 10  UserLocation     2803 non-null   object
 11  UserDescription  1735 non-null   object
 12  UserGender       2803 non-null   int64 
 13  UserFan          2803 non-null   int64 
 14  UserFollow       2803 non-null   int64 
 15  UserWeibo        2803 non-null   int64 
 16  UserVerified     2803 non-null   int32 
 17  LoyalFan         2803 non-null   

改变列的顺序使其更加可读

Change the order of column to make the dataframe more readable

In [61]:
cl = user_data['CommentLike']
user_data.drop('CommentLike',axis=1,inplace=True)
user_data.insert(7, 'LikeCount', cl)
user_data.head()

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,1001914040,3498,[3498],薪火鹏,1,1,0,0,广东,12,,广东,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0
1,1008309912,43,"[43, 149, 565, 1344]",提尔乌斯,4,4,0,0,上海,0,,上海,此处无人,1,0,0,0,0,0,0
2,1025900974,1394,"[1394, 1686]",猫的摇篮-伪物,2,2,0,1,天津,7,南开区,天津 南开区,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0
3,1028179843,2564,[2564],非常神奇的老z,1,0,1,0,上海,0,静安区,上海 静安区,,1,13,59,9,0,0,0
4,1035744261,1292,[1292],假装很强的萌新,1,1,0,0,广东,12,,广东,一个喜欢二次元的萌新,1,14,34,107,0,1,0


In [62]:
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   UserID           2803 non-null   object
 1   FirstIndex       2803 non-null   object
 2   IndexList        2803 non-null   object
 3   UserName         2803 non-null   object
 4   TotalComment     2803 non-null   int32 
 5   Comment          2803 non-null   int32 
 6   Reply            2803 non-null   int32 
 7   LikeCount        2803 non-null   int64 
 8   Province         2803 non-null   object
 9   ProvinceCode     2803 non-null   int32 
 10  Region           1277 non-null   object
 11  UserLocation     2803 non-null   object
 12  UserDescription  1735 non-null   object
 13  UserGender       2803 non-null   int64 
 14  UserFan          2803 non-null   int64 
 15  UserFollow       2803 non-null   int64 
 16  UserWeibo        2803 non-null   int64 
 17  UserVerified     2803 non-null   

## 2.5 UserDescription简化/Simplify the UserDescription

根据是否有description改成0 1 二分值

New column Description: based on that whether the user has a bio/description in his/her profile or not.<br>
Description = 1: Has<br>
Description = 0: Doesn't have

In [63]:
null_descrIdx=user_data[user_data['UserDescription'].isnull()].index
user_data.insert(12, "Description",1)
user_data.head()

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,UserLocation,Description,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,1001914040,3498,[3498],薪火鹏,1,1,0,0,广东,12,...,广东,1,这个世界并没有错，只是存在于那里而已。,1,30,56,692,0,0,0
1,1008309912,43,"[43, 149, 565, 1344]",提尔乌斯,4,4,0,0,上海,0,...,上海,1,此处无人,1,0,0,0,0,0,0
2,1025900974,1394,"[1394, 1686]",猫的摇篮-伪物,2,2,0,1,天津,7,...,天津 南开区,1,请在我们脏的时候爱我们。,1,917,1109,2013,0,0,0
3,1028179843,2564,[2564],非常神奇的老z,1,0,1,0,上海,0,...,上海 静安区,1,,1,13,59,9,0,0,0
4,1035744261,1292,[1292],假装很强的萌新,1,1,0,0,广东,12,...,广东,1,一个喜欢二次元的萌新,1,14,34,107,0,1,0


In [64]:
user_data.loc[null_descrIdx,'Description']=0
user_data['Description'].value_counts()

Description
1    1735
0    1068
Name: count, dtype: int64

In [65]:
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   UserID           2803 non-null   object
 1   FirstIndex       2803 non-null   object
 2   IndexList        2803 non-null   object
 3   UserName         2803 non-null   object
 4   TotalComment     2803 non-null   int32 
 5   Comment          2803 non-null   int32 
 6   Reply            2803 non-null   int32 
 7   LikeCount        2803 non-null   int64 
 8   Province         2803 non-null   object
 9   ProvinceCode     2803 non-null   int32 
 10  Region           1277 non-null   object
 11  UserLocation     2803 non-null   object
 12  Description      2803 non-null   int64 
 13  UserDescription  1735 non-null   object
 14  UserGender       2803 non-null   int64 
 15  UserFan          2803 non-null   int64 
 16  UserFollow       2803 non-null   int64 
 17  UserWeibo        2803 non-null   

## 2.6 特征工程：UserDescription中的特殊符号使用个数/Feature engineering: number of special characters in UserDescription 

In [66]:
# 匹配特殊符号/Match the special character form
special_char_pattern = re.compile(r'[^\u4e00-\u9fa5a-zA-Z0-9\s]')

# 使用 Pandas 的 str.count() 方法统计特殊符号个数/Use str.count() to calculate the number
user_data.insert(14,'SpecialChar',user_data['UserDescription'].str.count(special_char_pattern))

user_data.head(10)

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,Description,UserDescription,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,1001914040,3498,[3498],薪火鹏,1,1,0,0,广东,12,...,1,这个世界并没有错，只是存在于那里而已。,2.0,1,30,56,692,0,0,0
1,1008309912,43,"[43, 149, 565, 1344]",提尔乌斯,4,4,0,0,上海,0,...,1,此处无人,0.0,1,0,0,0,0,0,0
2,1025900974,1394,"[1394, 1686]",猫的摇篮-伪物,2,2,0,1,天津,7,...,1,请在我们脏的时候爱我们。,1.0,1,917,1109,2013,0,0,0
3,1028179843,2564,[2564],非常神奇的老z,1,0,1,0,上海,0,...,0,,,1,13,59,9,0,0,0
4,1035744261,1292,[1292],假装很强的萌新,1,1,0,0,广东,12,...,1,一个喜欢二次元的萌新,0.0,1,14,34,107,0,1,0
5,1036072925,1628,[1628],空空今天养乐多了吗,1,1,0,0,北京,4,...,1,佛系solo甜唯,0.0,0,403,1436,7101,1,1,6
6,1039477413,1442,[1442],阿卡牟-akamoo,1,0,1,0,福建,26,...,1,FF14 dota2 奥妙鸡,0.0,1,36,146,30,0,0,0
7,1046537351,551,"[551, 3287, 3521]",Y口十子,3,1,2,0,云南,1,...,1,RAmen,0.0,1,1145,377,4991,0,0,0
8,1066324192,702,[702],斯蒂芬徐,1,1,0,0,广东,12,...,1,银行🐶,1.0,1,352,1344,4089,0,0,0
9,1069108481,2827,[2827],FAITHLESSS,1,1,0,0,湖北,22,...,0,,,1,199,290,553,0,0,0


将NaN替换为0

Replace NaN with 0 

In [67]:
sc_nullIdx = user_data[user_data['SpecialChar'].isnull()].index
user_data.loc[sc_nullIdx, 'SpecialChar'] = 0
user_data[user_data['SpecialChar'].isnull()]

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,Description,UserDescription,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank


In [68]:
user_data['SpecialChar']=user_data['SpecialChar'].astype(int)
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   UserID           2803 non-null   object
 1   FirstIndex       2803 non-null   object
 2   IndexList        2803 non-null   object
 3   UserName         2803 non-null   object
 4   TotalComment     2803 non-null   int32 
 5   Comment          2803 non-null   int32 
 6   Reply            2803 non-null   int32 
 7   LikeCount        2803 non-null   int64 
 8   Province         2803 non-null   object
 9   ProvinceCode     2803 non-null   int32 
 10  Region           1277 non-null   object
 11  UserLocation     2803 non-null   object
 12  Description      2803 non-null   int64 
 13  UserDescription  1735 non-null   object
 14  SpecialChar      2803 non-null   int32 
 15  UserGender       2803 non-null   int64 
 16  UserFan          2803 non-null   int64 
 17  UserFollow       2803 non-null   

## 2.7 特征工程：UserDescription里的emoji和文本长度统计/Feature engineering: the emoji/emoticon in UserDescription and the length of description

由于不少用户在description里使用了emoji/emoticon和标点符号，因此我们：
- 先将所有的特殊符号移除，得到清理后的文本：DescriClean
- 将原始文本UserDescription中所有的emoji转换为对应的中文编码（如：`:code:`）
- 移除DescriClean中的空格后，计算DescriClean的长度（即，单字数）
- 对于使用了emoji的description，更新DescriClean的长度为：DescriClean的长度+emoji的个数

emoji和emoticon将会被交替使用

Since many users use emoji/emoticon and punctuation in UserDescription, we:
- Firstly, remove all special characters (including emoji, emoticon and punctuation) to get the cleaned text: DescriClean
- Secondly, convert all emojis/emoticons in the original text UserDescription to the corresponding Chinese Emoji Code (e.g. `:code:`).
- Thirdly, calculate the length of DescriClean (i.e., the number of single words) after removing the spaces in DescriClean
- Forthly, for a description with emoji/emoticon, update the length of DescriClean to: length of DescriClean + number of emoji/emoticon

emoji and emoticon will be used interchangeably. Note: emoji = emoticon

In [69]:
# 移除特殊符号和emoji/Remove special characters and emoji
user_data.insert(14,'DescriClean',user_data['UserDescription'].str.replace(special_char_pattern, '',regex=True))
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 23 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   UserID           2803 non-null   object
 1   FirstIndex       2803 non-null   object
 2   IndexList        2803 non-null   object
 3   UserName         2803 non-null   object
 4   TotalComment     2803 non-null   int32 
 5   Comment          2803 non-null   int32 
 6   Reply            2803 non-null   int32 
 7   LikeCount        2803 non-null   int64 
 8   Province         2803 non-null   object
 9   ProvinceCode     2803 non-null   int32 
 10  Region           1277 non-null   object
 11  UserLocation     2803 non-null   object
 12  Description      2803 non-null   int64 
 13  UserDescription  1735 non-null   object
 14  DescriClean      1735 non-null   object
 15  SpecialChar      2803 non-null   int32 
 16  UserGender       2803 non-null   int64 
 17  UserFan          2803 non-null   

解码emoji

Decode the emoji

In [70]:
def decode_emoji(text):
    return emoji.demojize(text,language='zh')


descrIdx=user_data[user_data['Description']==1].index


user_data.loc[descrIdx,'UserDescription']=user_data.loc[descrIdx,'UserDescription'].apply(decode_emoji)
user_data.head(10)

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,UserDescription,DescriClean,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,1001914040,3498,[3498],薪火鹏,1,1,0,0,广东,12,...,这个世界并没有错，只是存在于那里而已。,这个世界并没有错只是存在于那里而已,2,1,30,56,692,0,0,0
1,1008309912,43,"[43, 149, 565, 1344]",提尔乌斯,4,4,0,0,上海,0,...,此处无人,此处无人,0,1,0,0,0,0,0,0
2,1025900974,1394,"[1394, 1686]",猫的摇篮-伪物,2,2,0,1,天津,7,...,请在我们脏的时候爱我们。,请在我们脏的时候爱我们,1,1,917,1109,2013,0,0,0
3,1028179843,2564,[2564],非常神奇的老z,1,0,1,0,上海,0,...,,,0,1,13,59,9,0,0,0
4,1035744261,1292,[1292],假装很强的萌新,1,1,0,0,广东,12,...,一个喜欢二次元的萌新,一个喜欢二次元的萌新,0,1,14,34,107,0,1,0
5,1036072925,1628,[1628],空空今天养乐多了吗,1,1,0,0,北京,4,...,佛系solo甜唯,佛系solo甜唯,0,0,403,1436,7101,1,1,6
6,1039477413,1442,[1442],阿卡牟-akamoo,1,0,1,0,福建,26,...,FF14 dota2 奥妙鸡,FF14 dota2 奥妙鸡,0,1,36,146,30,0,0,0
7,1046537351,551,"[551, 3287, 3521]",Y口十子,3,1,2,0,云南,1,...,RAmen,RAmen,0,1,1145,377,4991,0,0,0
8,1066324192,702,[702],斯蒂芬徐,1,1,0,0,广东,12,...,银行:狗脸:,银行,1,1,352,1344,4089,0,0,0
9,1069108481,2827,[2827],FAITHLESSS,1,1,0,0,湖北,22,...,,,0,1,199,290,553,0,0,0


计算description的长度
- 先将长度DescriptionLen设置为0

Calculate the length of UserDescription: DescriptionLen
- Set 0 for all the entries of DescriptionLen

In [71]:
def len_text(text):
    return len(text.replace(" ", ""))

descrIdx=user_data[user_data['Description']==1].index
user_data.insert(15, "DescriptionLen",0)
user_data.head(10)

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,DescriClean,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,1001914040,3498,[3498],薪火鹏,1,1,0,0,广东,12,...,这个世界并没有错只是存在于那里而已,0,2,1,30,56,692,0,0,0
1,1008309912,43,"[43, 149, 565, 1344]",提尔乌斯,4,4,0,0,上海,0,...,此处无人,0,0,1,0,0,0,0,0,0
2,1025900974,1394,"[1394, 1686]",猫的摇篮-伪物,2,2,0,1,天津,7,...,请在我们脏的时候爱我们,0,1,1,917,1109,2013,0,0,0
3,1028179843,2564,[2564],非常神奇的老z,1,0,1,0,上海,0,...,,0,0,1,13,59,9,0,0,0
4,1035744261,1292,[1292],假装很强的萌新,1,1,0,0,广东,12,...,一个喜欢二次元的萌新,0,0,1,14,34,107,0,1,0
5,1036072925,1628,[1628],空空今天养乐多了吗,1,1,0,0,北京,4,...,佛系solo甜唯,0,0,0,403,1436,7101,1,1,6
6,1039477413,1442,[1442],阿卡牟-akamoo,1,0,1,0,福建,26,...,FF14 dota2 奥妙鸡,0,0,1,36,146,30,0,0,0
7,1046537351,551,"[551, 3287, 3521]",Y口十子,3,1,2,0,云南,1,...,RAmen,0,0,1,1145,377,4991,0,0,0
8,1066324192,702,[702],斯蒂芬徐,1,1,0,0,广东,12,...,银行,0,1,1,352,1344,4089,0,0,0
9,1069108481,2827,[2827],FAITHLESSS,1,1,0,0,湖北,22,...,,0,0,1,199,290,553,0,0,0


移除空格，并计算长度

Remove the space and calculate the length

In [72]:
user_data.loc[descrIdx,'DescriptionLen']=user_data.loc[descrIdx,'DescriClean'].apply(len_text)
user_data.head(10)

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,DescriClean,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,1001914040,3498,[3498],薪火鹏,1,1,0,0,广东,12,...,这个世界并没有错只是存在于那里而已,17,2,1,30,56,692,0,0,0
1,1008309912,43,"[43, 149, 565, 1344]",提尔乌斯,4,4,0,0,上海,0,...,此处无人,4,0,1,0,0,0,0,0,0
2,1025900974,1394,"[1394, 1686]",猫的摇篮-伪物,2,2,0,1,天津,7,...,请在我们脏的时候爱我们,11,1,1,917,1109,2013,0,0,0
3,1028179843,2564,[2564],非常神奇的老z,1,0,1,0,上海,0,...,,0,0,1,13,59,9,0,0,0
4,1035744261,1292,[1292],假装很强的萌新,1,1,0,0,广东,12,...,一个喜欢二次元的萌新,10,0,1,14,34,107,0,1,0
5,1036072925,1628,[1628],空空今天养乐多了吗,1,1,0,0,北京,4,...,佛系solo甜唯,8,0,0,403,1436,7101,1,1,6
6,1039477413,1442,[1442],阿卡牟-akamoo,1,0,1,0,福建,26,...,FF14 dota2 奥妙鸡,12,0,1,36,146,30,0,0,0
7,1046537351,551,"[551, 3287, 3521]",Y口十子,3,1,2,0,云南,1,...,RAmen,5,0,1,1145,377,4991,0,0,0
8,1066324192,702,[702],斯蒂芬徐,1,1,0,0,广东,12,...,银行,2,1,1,352,1344,4089,0,0,0
9,1069108481,2827,[2827],FAITHLESSS,1,1,0,0,湖北,22,...,,0,0,1,199,290,553,0,0,0


计算emoji（以样式`:code:`为对象）的个数，并更新相应的description的长度为：DescriClean的长度+emoji的个数

Calculate the number of emoji (represented as `:code:`), and update the length of DescriClean to: length of DescriClean + number of emoji

In [73]:
emoji_con = user_data['UserDescription'].str.count(re.compile(r':.*?:'))
emoji_idx = user_data[(emoji_con.notna()) & (emoji_con>0)].index
emoji_idx

Index([   8,   17,   24,   52,  146,  158,  177,  178,  179,  194,
       ...
       2343, 2367, 2411, 2463, 2504, 2596, 2613, 2656, 2689, 2785],
      dtype='int64', length=105)

In [74]:
user_data.loc[emoji_idx,'DescriptionLen'] = user_data.loc[emoji_idx,'DescriptionLen'] + emoji_con[emoji_idx]
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 24 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   UserID           2803 non-null   object
 1   FirstIndex       2803 non-null   object
 2   IndexList        2803 non-null   object
 3   UserName         2803 non-null   object
 4   TotalComment     2803 non-null   int32 
 5   Comment          2803 non-null   int32 
 6   Reply            2803 non-null   int32 
 7   LikeCount        2803 non-null   int64 
 8   Province         2803 non-null   object
 9   ProvinceCode     2803 non-null   int32 
 10  Region           1277 non-null   object
 11  UserLocation     2803 non-null   object
 12  Description      2803 non-null   int64 
 13  UserDescription  1735 non-null   object
 14  DescriClean      1735 non-null   object
 15  DescriptionLen   2803 non-null   int64 
 16  SpecialChar      2803 non-null   int32 
 17  UserGender       2803 non-null   

用户数据库搭建完毕。可以发现关键特征均不为空值<br>
考虑到便利性，必要的特征被提取出来并保存为一个新的dataframe

The user database/profile is successfully built. It is clear that all key features are without null values.
Considering the convenience in the subsequent analysis, necessary features are extracted and saved into a new dataframe

In [75]:
user_file='Datasets/UserData.csv'
user_data.to_csv(user_file, index=False,encoding='utf-8-sig')

In [76]:
user_profile = user_data.iloc[:,[0,2,3,4,5,6,7,9,12,15,16,17,18,19,20,21,22,23]]
user_profile.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   UserID          2803 non-null   object
 1   IndexList       2803 non-null   object
 2   UserName        2803 non-null   object
 3   TotalComment    2803 non-null   int32 
 4   Comment         2803 non-null   int32 
 5   Reply           2803 non-null   int32 
 6   LikeCount       2803 non-null   int64 
 7   ProvinceCode    2803 non-null   int32 
 8   Description     2803 non-null   int64 
 9   DescriptionLen  2803 non-null   int64 
 10  SpecialChar     2803 non-null   int32 
 11  UserGender      2803 non-null   int64 
 12  UserFan         2803 non-null   int64 
 13  UserFollow      2803 non-null   int64 
 14  UserWeibo       2803 non-null   int64 
 15  UserVerified    2803 non-null   int32 
 16  LoyalFan        2803 non-null   int64 
 17  VipRank         2803 non-null   int64 
dtypes: int32

In [77]:
profile='Datasets/UserProfile.csv'
user_profile.to_csv(profile, index=False,encoding='utf-8-sig')

## 2.8 读取处理后的数据user_data和profile_data/Read the processed dataset: `user_data` and `profile_data`

user_data：用户的原始数据库，
profile_data：提取出必要的feature后的用户数据库（来源于user_data）

user_data: the original user database/profile after all preprocessings shown in Section 2.1~2.7;<br>
profile_data: the final user database/profile after extracting necessary features from the user_data

In [78]:
user_data=pd.read_csv('Datasets/UserData.csv')
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 24 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   UserID           2803 non-null   int64 
 1   FirstIndex       2803 non-null   int64 
 2   IndexList        2803 non-null   object
 3   UserName         2803 non-null   object
 4   TotalComment     2803 non-null   int64 
 5   Comment          2803 non-null   int64 
 6   Reply            2803 non-null   int64 
 7   LikeCount        2803 non-null   int64 
 8   Province         2803 non-null   object
 9   ProvinceCode     2803 non-null   int64 
 10  Region           1277 non-null   object
 11  UserLocation     2803 non-null   object
 12  Description      2803 non-null   int64 
 13  UserDescription  1735 non-null   object
 14  DescriClean      1654 non-null   object
 15  DescriptionLen   2803 non-null   int64 
 16  SpecialChar      2803 non-null   int64 
 17  UserGender       2803 non-null   

In [79]:
profile_data=pd.read_csv('Datasets/UserProfile.csv')
profile_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   UserID          2803 non-null   int64 
 1   IndexList       2803 non-null   object
 2   UserName        2803 non-null   object
 3   TotalComment    2803 non-null   int64 
 4   Comment         2803 non-null   int64 
 5   Reply           2803 non-null   int64 
 6   LikeCount       2803 non-null   int64 
 7   ProvinceCode    2803 non-null   int64 
 8   Description     2803 non-null   int64 
 9   DescriptionLen  2803 non-null   int64 
 10  SpecialChar     2803 non-null   int64 
 11  UserGender      2803 non-null   int64 
 12  UserFan         2803 non-null   int64 
 13  UserFollow      2803 non-null   int64 
 14  UserWeibo       2803 non-null   int64 
 15  UserVerified    2803 non-null   int64 
 16  LoyalFan        2803 non-null   int64 
 17  VipRank         2803 non-null   int64 
dtypes: int64

# 3. 重组数据集1：为构建异构图准备/The first dataset reorganization: to prepare for the Heterogeneous Graph Construction

异构图包括2个层级的图：用户互动网和用户评论所属关系。
- 用户互动网：包含了用户的在目标帖子下的所有行为（即，留下评论、被他人回复/回复他人、提及/@其他用户）
- 用户评论所属关系：指示了用户与其在目标帖子下所发表的文本的对应关系

目前的数据集每一行均表示一个用户的发布的一条文本。而除了通过RootID可以定位到该user的回复对象之外，与提及有关的信息是被包含在文本里的。因此，为了后续构建异构图的便利性，我们需要重组数据集，使其变得更加可读。

The Heterogeneous Graph (HeteroG) proposed in this study consists of 2 graph levels: the user interaction network (`user-user`) and user comment affiliation (`user-comment`)
- user-user: Contain all the behaviors of users under the target post (i.e., leaving comments, replying to others, metioning/@ other users)
- user-comment: Indicate the correspondence/affiliation relationship between a user and all texts he/she left under the target post

Each row of the current dataset represents one text posted by a user. RootID can be used to identify the individual of one certain user reply to. However, information related to @ is contained in the text. Therefore, reorganizing the dataset to make it more readable for the convenience of subsequent HeteroG construction is necessary.

In [80]:
# 检查数据集情况/Check the dataset
comment_data=data.copy()
comment_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype                    
---  ------           --------------  -----                    
 0   CommentID        3850 non-null   int64                    
 1   CommentTime      3850 non-null   datetime64[ns, UTC+08:00]
 2   RootID           3850 non-null   int64                    
 3   CommentRaw       3850 non-null   object                   
 4   Comment          3829 non-null   object                   
 5   CommentLike      3850 non-null   int64                    
 6   CommentReply     2664 non-null   Int64                    
 7   UserID           3850 non-null   int64                    
 8   UserName         3850 non-null   object                   
 9   Province         3850 non-null   object                   
 10  ProvinceCode     3850 non-null   int32                    
 11  Region           1694 non-null   object                 

我们可以发现存在回复/reply缺失的情况
- CommentReply是通过爬虫得到的、该文本的回复数（对于所有有回复的文本，回复总和理论上应该为1868）
- 在用脚本抓取既存回复时，使用了CommentReply=Null来作为指示。但我们可以发现，CommentReply为空值的记录仅有1186条，不等于回复总和1868
该发现表明，在微博上，是存在“评论/回复被删除、且无法通过爬虫获取”的情况

It is clear that there are missing replies in the crawled dataset
- The value of CommentReply is the number of replies to the one certain text, obtained by the Python Script (for all texts with replies, the total number of replies should be 1868)
- CommentReply = Null is used to indicate that "this text is a piece of reply and can be crawled by the script". However, there are only 1186 Null records in the CommentReply. This is not equal to the 1868

This means that, on Weibo, some comments/replies are removed due to some unknown reasons and cannot be retrived by the script.

In [81]:
comment_data[comment_data['CommentReply']>0]['CommentReply'].sum()

1868

In [82]:
len(comment_data[comment_data['CommentReply'].isnull()])

1186

## 3.1 重组数据，构建user-user/Re-organize dataset and construct user-user interaction dataset

为保证全面性，user-user数据集拟包含3类信息：From（行为发出者），To（行为接受者）和Floor（层现象）。下面以一个简单的例子来进行说明

To ensure the comprehensiveness, the user-user dataset is proposed to contain 3 types of information: From, To and Floor. A simple example listed below will be helpful for understanding.

Simple example
- $U_A$: ...... ($C_A$)
  - $U_B$: ...... ($C_B$)
  - $U_C$: Reply@$U_B$: ...... ($C_C$)
  - $U_A$: Reply@$U_C$: ...... ($C_D$)

在这个例子中，总计4条评论的From、To和Floor可以被总结成下表:

In this example, From, To and Floor for all those 4 comments can be concluded as the table below:

<table>
  <colgroup>
    <col style="width: 25%">
    <col style="width: 25%">
    <col style="width: 25%">
    <col style="width: 25%">
  </colgroup>
  <tr>
    <th>Index</th>
    <th>From</th>
    <th>To</th>
    <th>Floor</th>
  </tr>
  <tr>
    <td>$C_A$</td>
    <td>$U_A$</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <td>$C_B$</td>
    <td>$U_B$</td>
    <td>$U_A$</td>
    <td>$U_A$</td>
  </tr>
  <tr>
    <td>$C_C$</td>
    <td>$U_C$</td>
    <td>$U_B$</td>
    <td>$U_A$</td>
  </tr>
  <tr>
    <td>$C_D$</td>
    <td>$U_A$</td>
    <td>$U_C$</td>
    <td>$U_A$</td>
  </tr>
</table>

- From: `F_Idx`, `F_CommID`, `F_Time`, `F_UserID`, `F_UserName`, `F_Comment`
- To: `T_Idx`, `T_UserID`, `T_UserName`, `T_CommID`, `T_Time`, `T_Comment`
- Floor: `Floor_Idx`, `Floor_CommID`, `Floor_UserID`

最终我们设计了上述15个列。列名后缀可参考下列解释<br>
Finally we design 15 columns. Explanations for the suffix of eahc column name are listed below
- Idx：经过Section 1处理后的数据集`data`的DataFrame的索引/DataFrame Index of the dataset `data` after Section 1 processing
- CommID：该条文本的ID/Text ID
- Time：该条文本的发布时间/Post time of the text
- UserID：用户ID/User ID
- UserName：用户名/User name
- Comment：文本内容/Text

In [83]:
# 复制/Copy
inter_data=comment_data.copy()
inter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype                    
---  ------           --------------  -----                    
 0   CommentID        3850 non-null   int64                    
 1   CommentTime      3850 non-null   datetime64[ns, UTC+08:00]
 2   RootID           3850 non-null   int64                    
 3   CommentRaw       3850 non-null   object                   
 4   Comment          3829 non-null   object                   
 5   CommentLike      3850 non-null   int64                    
 6   CommentReply     2664 non-null   Int64                    
 7   UserID           3850 non-null   int64                    
 8   UserName         3850 non-null   object                   
 9   Province         3850 non-null   object                   
 10  ProvinceCode     3850 non-null   int32                    
 11  Region           1694 non-null   object                 

移除不必要的列（如用户信息）

Remove the unnecessary columns (e.g., user features)

In [84]:
comment_data

Unnamed: 0,CommentID,CommentTime,RootID,CommentRaw,Comment,CommentLike,CommentReply,UserID,UserName,Province,...,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,4915861950828535,2023-06-23 18:31:24+08:00,0,考古,考古,0,0,7752387333,祎只狸猫,其他,...,,其他,橘圈小透明阮淼祎,0,45,149,260,0,1,0
1,4910942133158267,2023-06-10 04:41:48+08:00,0,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,0,0,7604058796,邮一棵草莓i,其他,...,,其他,垃圾滚！！！,0,2,48,124,0,0,0
2,4862811883439433,2023-01-28 09:09:22+08:00,0,真不知道迅哥给评论区投了多少米？[允悲][doge],真不知道迅哥给评论区投了多少米？,0,0,7577283965,kmimg7,甘肃,...,庆阳,甘肃 庆阳,,1,0,31,4,0,0,0
3,4862705678417948,2023-01-28 02:07:20+08:00,0,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗,0,0,7476902376,小祀弟弟吖,其他,...,,其他,,0,0,39,2,0,0,0
4,4858911816944130,2023-01-17 14:51:54+08:00,0,[吃瓜]现在是2023年 回来考古的点个赞,现在是2023年 回来考古的点个赞,0,0,7724649649,不知道如何评价,其他,...,,其他,,1,0,55,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3845,4380879537777020,2019-06-08 12:02:17+08:00,0,好！,好！,0,0,3086524793,逍遥鸟酱,其他,...,,其他,Valder Fields,1,210,483,3371,0,0,0
3846,4380879528787121,2019-06-08 12:02:15+08:00,0,快测试等不及啦！@Re莫德雷德厨_,快测试等不及啦！@Re莫德雷德厨_,2,0,6058348808,德不不不奶,其他,...,,其他,哪里是喜欢病娇，其实就是喜欢被一个人坚定选择的感觉罢了…,1,145,627,2943,0,0,0
3847,4380879427903629,2019-06-08 12:01:51+08:00,0,给我也整一个@璟也想成为物理学家,给我也整一个@璟也想成为物理学家,0,0,2836607150,天使爱吃麻婆豆腐,陕西,...,西安,陕西 西安,,1,35,768,23,0,0,0
3848,4380879289628087,2019-06-08 12:01:18+08:00,0,Pv出来了！,Pv出来了！,0,0,7161385652,Neilwasabi,上海,...,,上海,,1,0,124,0,0,0,0


In [85]:
user_drop=list(inter_data.columns[9:])
inter_data=inter_data.drop(user_drop,axis=1)
inter_data=inter_data.drop(['Comment'],axis=1)

In [86]:
inter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   CommentID     3850 non-null   int64                    
 1   CommentTime   3850 non-null   datetime64[ns, UTC+08:00]
 2   RootID        3850 non-null   int64                    
 3   CommentRaw    3850 non-null   object                   
 4   CommentLike   3850 non-null   int64                    
 5   CommentReply  2664 non-null   Int64                    
 6   UserID        3850 non-null   int64                    
 7   UserName      3850 non-null   object                   
dtypes: Int64(1), datetime64[ns, UTC+08:00](1), int64(4), object(2)
memory usage: 244.5+ KB


新建一个DataFrame `df` 来储存重组后的表

Create a new DataFrame `df` to save the re-organized one

In [87]:
column_names=['F_Idx','F_CommID','F_Time','F_UserID','F_UserName','F_Comment','T_Idx','T_UserID','T_UserName','T_CommID','T_Time','T_Comment','Floor_Idx','Floor_CommID','Floor_UserID']
df = pd.DataFrame(columns=column_names)
df

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID


In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   F_Idx         0 non-null      object
 1   F_CommID      0 non-null      object
 2   F_Time        0 non-null      object
 3   F_UserID      0 non-null      object
 4   F_UserName    0 non-null      object
 5   F_Comment     0 non-null      object
 6   T_Idx         0 non-null      object
 7   T_UserID      0 non-null      object
 8   T_UserName    0 non-null      object
 9   T_CommID      0 non-null      object
 10  T_Time        0 non-null      object
 11  T_Comment     0 non-null      object
 12  Floor_Idx     0 non-null      object
 13  Floor_CommID  0 non-null      object
 14  Floor_UserID  0 non-null      object
dtypes: object(15)
memory usage: 124.0+ bytes


### 3.1.1 用户类型1（Plain）：未涉及用户间的互动，仅留下了评论/User type 1 (Plain): no user interactions; only comments were left

标识：RootID和CommentReply均为0

Both RootID and CommentReply are 0

In [89]:
plain_comm=inter_data[(inter_data['RootID']==0) & (inter_data['CommentReply']==0)]
plain_comm

Unnamed: 0,CommentID,CommentTime,RootID,CommentRaw,CommentLike,CommentReply,UserID,UserName
0,4915861950828535,2023-06-23 18:31:24+08:00,0,考古,0,0,7752387333,祎只狸猫
1,4910942133158267,2023-06-10 04:41:48+08:00,0,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,0,0,7604058796,邮一棵草莓i
2,4862811883439433,2023-01-28 09:09:22+08:00,0,真不知道迅哥给评论区投了多少米？[允悲][doge],0,0,7577283965,kmimg7
3,4862705678417948,2023-01-28 02:07:20+08:00,0,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],0,0,7476902376,小祀弟弟吖
4,4858911816944130,2023-01-17 14:51:54+08:00,0,[吃瓜]现在是2023年 回来考古的点个赞,0,0,7724649649,不知道如何评价
...,...,...,...,...,...,...,...,...
3845,4380879537777020,2019-06-08 12:02:17+08:00,0,好！,0,0,3086524793,逍遥鸟酱
3846,4380879528787121,2019-06-08 12:02:15+08:00,0,快测试等不及啦！@Re莫德雷德厨_,2,0,6058348808,德不不不奶
3847,4380879427903629,2019-06-08 12:01:51+08:00,0,给我也整一个@璟也想成为物理学家,0,0,2836607150,天使爱吃麻婆豆腐
3848,4380879289628087,2019-06-08 12:01:18+08:00,0,Pv出来了！,0,0,7161385652,Neilwasabi


In [90]:
# 提取Plain User的信息/Extract information of Plain User
''' 
注意：该函数可以对单行使用; 如果用数据表apply，那就是对整个数据表使用
Hint: This def function can be used on a single row/record; If you want to conduct it on the whole DataFrame, just use apply (as shown in the code below)
'''
def extract_plain(row, m_idx=None, m_UId=None, m_CId=None):
    plain_values = [row.name, row[0], row[1], row[6], row[7], row[3]]
    if m_idx and m_UId and m_CId:
        comm_values = [row.name, row[0], row[1], row[6], row[7], row[3], np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, m_idx, m_CId, m_UId]
        df.loc[len(df)] = comm_values
    else:
        df.loc[len(df)] = plain_values + [np.nan] * (len(column_names) - len(plain_values))

plain_comm.apply(extract_plain, axis=1)
df

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
0,0,4915861950828535,2023-06-23 18:31:24+08:00,7752387333,祎只狸猫,考古,,,,,,,,,
1,1,4910942133158267,2023-06-10 04:41:48+08:00,7604058796,邮一棵草莓i,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,,,,,,,,,
2,2,4862811883439433,2023-01-28 09:09:22+08:00,7577283965,kmimg7,真不知道迅哥给评论区投了多少米？[允悲][doge],,,,,,,,,
3,3,4862705678417948,2023-01-28 02:07:20+08:00,7476902376,小祀弟弟吖,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],,,,,,,,,
4,4,4858911816944130,2023-01-17 14:51:54+08:00,7724649649,不知道如何评价,[吃瓜]现在是2023年 回来考古的点个赞,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2456,3845,4380879537777020,2019-06-08 12:02:17+08:00,3086524793,逍遥鸟酱,好！,,,,,,,,,
2457,3846,4380879528787121,2019-06-08 12:02:15+08:00,6058348808,德不不不奶,快测试等不及啦！@Re莫德雷德厨_,,,,,,,,,
2458,3847,4380879427903629,2019-06-08 12:01:51+08:00,2836607150,天使爱吃麻婆豆腐,给我也整一个@璟也想成为物理学家,,,,,,,,,
2459,3848,4380879289628087,2019-06-08 12:01:18+08:00,7161385652,Neilwasabi,Pv出来了！,,,,,,,,,


### 3.1.2 用户类型2（Floor）：涉及到“层结构”的用户/User Type 2 (Floor): all users involved in the "Floor Structure"

作为标志性的“回复@UserName：”在数据集中的格式非常的不统一。此外，用户名也会包含各种特殊符号。<br>
因此使用正则表达式`reply_pattern`来确认该pattern

The form "Reply@UserName" as the mark of this user type is inconsistent in this dataset. Furthermore, some usernames include special characters<br>
Therefore, it will be bettern to identify such a pattern via regular expression `reply_pattern`

In [91]:
# 例子：包含特殊符号的用户名/An example of UserName
inter_data.loc[3048]

CommentID                                        4385914031449975
CommentTime                             2019-06-22 09:27:34+08:00
RootID                                           4380899494009653
CommentRaw      回复@醉筠子-不放弃本子:你说你🐎呢，看见个开放世界，看见滑翔翼就，看见卡通渲染就塞尔达。塞...
CommentLike                                                     0
CommentReply                                                 <NA>
UserID                                                 5356469720
UserName                                                error1980
Name: 3048, dtype: object

In [92]:
reply_pattern = r'回复\s?@([^\s:：]+)'
'''测试/test'''
inter_data[inter_data['CommentRaw'].str.contains(reply_pattern)].loc[3755]

CommentID                        4380987938821807
CommentTime             2019-06-08 19:13:02+08:00
RootID                           4380883840572546
CommentRaw      回复 @网恋被骗700万:鹅厂抄就恶心，米哈游抄就是正事嗷[嘻嘻]
CommentLike                                    14
CommentReply                                 <NA>
UserID                                 5596441213
UserName                                    法海爱一休
Name: 3755, dtype: object

In [93]:
'''测试/test'''
inter_data[inter_data['CommentRaw'].str.contains(reply_pattern)].loc[557]

CommentID                                        4389266353859602
CommentTime                             2019-07-01 15:28:30+08:00
RootID                                                          0
CommentRaw      回复@夜降萃梦乡_：原神不是耻辱，你这种带节奏的才是耻辱，你倒是给我粗制滥造一个啊，某大厂新...
CommentLike                                                     0
CommentReply                                                    1
UserID                                                 3281534190
UserName                                                    开辟之星3
Name: 557, dtype: object

In [94]:
inter_data[inter_data['CommentRaw'].str.contains(reply_pattern)]

Unnamed: 0,CommentID,CommentTime,RootID,CommentRaw,CommentLike,CommentReply,UserID,UserName
57,4661616077177105,2021-07-22 04:29:03+08:00,4460703643517009,回复@蠢萌萌66:毕竟当初被带节奏的总不能打自己的脸，只能选择性忽略当个无脑黑了[吃瓜],0,,7375957830,黑鸭嗨哟
78,4472599378056641,2020-02-16 14:24:11+08:00,4452751050341087,回复@孙笑川5655:哈哈哈哈哈哈哈，装逼谁不会呀，扯个外星人就是高端数码粉嗷，牛逼牛逼[哈哈],0,,7087027730,云夜悠长
80,4460825508899118,2020-01-15 02:39:01+08:00,4452751050341087,回复@有毒的茶茶:爷要换外星人Alienware了，你们就继续吹吧，只有手机的白嫖党们,0,,7217990194,孙笑川5655
83,4459449890115897,2020-01-11 07:32:48+08:00,4452751050341087,回复@慧骃要成为本子画师:他是抄袭塞尔达的，你看不出来吗。我有整个塞尔达系列的所有卡带。你这...,0,,7217990194,孙笑川5655
103,4460676678814525,2020-01-14 16:47:37+08:00,4441023834998487,回复@我伊布贼溜:老任把原神放进ns了，一定是米给任天堂塞钱了，作为任豚真的有被冒犯到[怒],0,,6192175029,Zh40q14NcheN
...,...,...,...,...,...,...,...,...
3835,4380992594400846,2019-06-08 19:31:32+08:00,4380879889748007,回复@沙奈朵的裙底到底有什么:是谁逗谁笑也请你整清楚，我也明确表态了不想吵，阴阳怪气的回复大...,4,,6201564748,兔纸今天能摸到鱼吗
3836,4380988769262132,2019-06-08 19:16:20+08:00,4380879889748007,回复@怕事先改名shaw:[doge]求求你别玩任天堂，你也不想想是谁不配,10,,2950862475,男人的浪漫是剑风传奇
3837,4380988346320255,2019-06-08 19:14:39+08:00,4380879889748007,回复@怕事先改名shaw:你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就...,26,,5646067604,Aki是乌龟
3838,4380987008064553,2019-06-08 19:09:20+08:00,4380879889748007,回复@沉迷艾欧泽亚的菜菜啊:🐮🍺，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这...,3,,6201564748,兔纸今天能摸到鱼吗


同样可以被用于提取“回复@UserName：”中的用户名

This regular expression can also be used to extract username in the form "Reply@UserName:"

In [95]:
inter_data['CommentRaw'].str.extractall(reply_pattern).groupby(level=0).apply(lambda x: ','.join(x[0]))[3755]

'网恋被骗700万'

In [96]:
inter_data['CommentRaw'].str.extractall(reply_pattern).groupby(level=0).apply(lambda x: ','.join(x[0]))[3048]

'醉筠子-不放弃本子'

In [97]:
inter_data['CommentRaw'].str.extractall(reply_pattern).groupby(level=0).apply(lambda x: ','.join(x[0]))[557]

'夜降萃梦乡_'

正则表达式有起作用。开始处理数据集

It is clear that the regular expression works. Then handle the whole dataset

In [98]:
floor_comm = inter_data[(inter_data['RootID']==0) & (inter_data['CommentReply']!=0)]
floor_comm

Unnamed: 0,CommentID,CommentTime,RootID,CommentRaw,CommentLike,CommentReply,UserID,UserName
9,4832608544361987,2022-11-06 00:52:05+08:00,0,两年了，回过头来看[打call][打call][打call][打call],9,1,7776168152,养着兔子的猫咪
18,4747356392395706,2022-03-15 18:50:28+08:00,0,前面的评论跟现在完全不一样，全都打脸场面,17,3,7415480335,书一点禾
21,4745116860550884,2022-03-09 14:31:22+08:00,0,阿伟你又在翻看古战场诶 休息一下吧[doge],5,3,5952984271,罗小黑本喵_
28,4729105448441116,2022-01-24 10:07:44+08:00,0,这么早就有神里绫华了？？,0,2,6313769853,希儿呐xy
36,4652736522752139,2021-06-27 16:24:53+08:00,0,为什么蒙德的序章会有神里[二哈],15,1,6005792475,绯羽绯葬
...,...,...,...,...,...,...,...,...
3506,4380885732287240,2019-06-08 12:26:53+08:00,0,这……有些地方一摸一样吧,1767,207,6500909470,oO土豆泥O
3676,4380884881100488,2019-06-08 12:23:30+08:00,0,@AI娘爱酱 爱酱我想去原神看看,0,1,5667435118,Mmmm傻呀
3685,4380883840572546,2019-06-08 12:19:23+08:00,0,塞尔达即视感,2450,123,1957259477,鸥洗恩
3796,4380881706079221,2019-06-08 12:10:54+08:00,0,龙…龙之谷？？[允悲],3,2,2856517332,hakumei_surfing_ver


In [99]:
reply_pattern = r'回复\s?@([^\s:：]+)'
# 用于提取评论和回复/Used to extract comments and reply
def extract_comm(row1, m_idx, m_UId, m_UName, m_CId, m_time, m_Comm):
    comm_values = [row1.name, row1[0], row1[1], row1[6], row1[7], row1[3], m_idx, m_UId, m_UName, m_CId, m_time, m_Comm, m_idx, m_CId, m_UId]
    df.loc[len(df)] = comm_values

In [100]:
# 处理floor_comm/Hanld floor_comm
def extract_po(row0):
    # 提取floor_comm的东信息/Extract the information related to Floor
    m_idx = row0.name 
    m_CId = row0[0]
    m_time = row0[1]
    m_Comm = row0[3]
    m_UId = row0[6]
    m_UName = row0[7]

    sub1 = inter_data[(inter_data['CommentID']==m_CId) | (inter_data['RootID']==m_CId)]
    extract_plain(sub1.iloc[0])
    
    sub2 = sub1[(~sub1['CommentRaw'].str.contains(reply_pattern, na=False)) & (sub1['RootID']!=0)]
    if not sub2.empty:
        sub2.apply(extract_comm, axis=1,args=(m_idx, m_UId, m_UName, m_CId, m_time, m_Comm))

    sub3 = sub1[(sub1['CommentRaw'].str.contains(reply_pattern, na=False)) & (sub1['RootID']!=0)]
    if not sub3.empty:
        sub3.apply(extract_plain, axis=1, args=(m_idx, m_UId, m_CId))
        rpl_user = sub3['CommentRaw'].str.extractall(reply_pattern).groupby(level=0).apply(lambda x: ','.join(x[0]))
        rpl_userIdx = list(rpl_user.index)
        rpl_Idx = list(df[df['F_Idx'].apply(lambda x: any(num in rpl_userIdx for num in [x]))].index)
        df.loc[rpl_Idx, 'T_UserName'] = list(rpl_user)

In [101]:
floor_comm.apply(extract_po, axis=1)

9       None
18      None
21      None
28      None
36      None
        ... 
3506    None
3676    None
3685    None
3796    None
3821    None
Length: 203, dtype: object

In [102]:
df

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
0,0,4915861950828535,2023-06-23 18:31:24+08:00,7752387333,祎只狸猫,考古,,,,,NaT,,,,
1,1,4910942133158267,2023-06-10 04:41:48+08:00,7604058796,邮一棵草莓i,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,,,,,NaT,,,,
2,2,4862811883439433,2023-01-28 09:09:22+08:00,7577283965,kmimg7,真不知道迅哥给评论区投了多少米？[允悲][doge],,,,,NaT,,,,
3,3,4862705678417948,2023-01-28 02:07:20+08:00,7476902376,小祀弟弟吖,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],,,,,NaT,,,,
4,4,4858911816944130,2023-01-17 14:51:54+08:00,7724649649,不知道如何评价,[吃瓜]现在是2023年 回来考古的点个赞,,,,,NaT,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3843,3835,4380992594400846,2019-06-08 19:31:32+08:00,6201564748,兔纸今天能摸到鱼吗,回复@沙奈朵的裙底到底有什么:是谁逗谁笑也请你整清楚，我也明确表态了不想吵，阴阳怪气的回复大...,,,沙奈朵的裙底到底有什么,,NaT,,3821.0,4.380880e+15,6.201565e+09
3844,3836,4380988769262132,2019-06-08 19:16:20+08:00,2950862475,男人的浪漫是剑风传奇,回复@怕事先改名shaw:[doge]求求你别玩任天堂，你也不想想是谁不配,,,怕事先改名shaw,,NaT,,3821.0,4.380880e+15,6.201565e+09
3845,3837,4380988346320255,2019-06-08 19:14:39+08:00,5646067604,Aki是乌龟,回复@怕事先改名shaw:你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就...,,,怕事先改名shaw,,NaT,,3821.0,4.380880e+15,6.201565e+09
3846,3838,4380987008064553,2019-06-08 19:09:20+08:00,6201564748,兔纸今天能摸到鱼吗,回复@沉迷艾欧泽亚的菜菜啊:🐮🍺，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这...,,,沉迷艾欧泽亚的菜菜啊,,NaT,,3821.0,4.380880e+15,6.201565e+09


再次检查前面3个特殊的username（“网恋被骗700万”，“醉筠子-不放弃本子”，“夜降萃梦乡_”）是否得到了正确的处理。

Double check whether those 3 special username ("网恋被骗700万", "醉筠子-不放弃本子" and "夜降萃梦乡_") are conducted successfully.

In [103]:
df[df['F_Idx']==3755]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
3812,3755,4380987938821807,2019-06-08 19:13:02+08:00,5596441213,法海爱一休,回复 @网恋被骗700万:鹅厂抄就恶心，米哈游抄就是正事嗷[嘻嘻],,,网恋被骗700万,,NaT,,3685.0,4380884000000000.0,1957259000.0


In [104]:
df[df['F_Idx']==3048]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
3274,3048,4385914031449975,2019-06-22 09:27:34+08:00,5356469720,error1980,回复@醉筠子-不放弃本子:你说你🐎呢，看见个开放世界，看见滑翔翼就，看见卡通渲染就塞尔达。塞...,,,醉筠子-不放弃本子,,NaT,,2992.0,4380899000000000.0,6578280000.0


“夜降萃梦乡_”的用户名未能被成功提取。这是“夜降萃梦乡_”的原始评论丢失导致的。我们手动填充即可

The username "夜降萃梦乡_" was not identified, which is due to the missing original comment from "夜降萃梦乡_". We can fill it manually

In [105]:
df[df['F_Idx']==557]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
2705,557,4389266353859602,2019-07-01 15:28:30+08:00,3281534190,开辟之星3,回复@夜降萃梦乡_：原神不是耻辱，你这种带节奏的才是耻辱，你倒是给我粗制滥造一个啊，某大厂新...,,,,,NaT,,,,


In [106]:
inter_data.loc[557]

CommentID                                        4389266353859602
CommentTime                             2019-07-01 15:28:30+08:00
RootID                                                          0
CommentRaw      回复@夜降萃梦乡_：原神不是耻辱，你这种带节奏的才是耻辱，你倒是给我粗制滥造一个啊，某大厂新...
CommentLike                                                     0
CommentReply                                                    1
UserID                                                 3281534190
UserName                                                    开辟之星3
Name: 557, dtype: object

In [107]:
df.loc[2705, 'T_UserName'] = re.findall(reply_pattern, df.loc[2705,'F_Comment'])[0]
df.loc[2705]

F_Idx                                                         557
F_CommID                                         4389266353859602
F_Time                                  2019-07-01 15:28:30+08:00
F_UserID                                               3281534190
F_UserName                                                  开辟之星3
F_Comment       回复@夜降萃梦乡_：原神不是耻辱，你这种带节奏的才是耻辱，你倒是给我粗制滥造一个啊，某大厂新...
T_Idx                                                         NaN
T_UserID                                                      NaN
T_UserName                                                 夜降萃梦乡_
T_CommID                                                      NaN
T_Time                                                        NaT
T_Comment                                                     NaN
Floor_Idx                                                     NaN
Floor_CommID                                                  NaN
Floor_UserID                                                  NaN
Name: 2705

应当有3850条记录，但`df`只有3848条。缺失的2条数据将在Section 3.1.3进行调查和处理

There should be 3850 entries rather than 3848 shown below. The 2 missing data will be handled in the Section 3.1.3

In [108]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3848 entries, 0 to 3847
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   F_Idx         3848 non-null   int64                    
 1   F_CommID      3848 non-null   int64                    
 2   F_Time        3848 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID      3848 non-null   int64                    
 4   F_UserName    3848 non-null   object                   
 5   F_Comment     3848 non-null   object                   
 6   T_Idx         439 non-null    float64                  
 7   T_UserID      439 non-null    float64                  
 8   T_UserName    1185 non-null   object                   
 9   T_CommID      439 non-null    float64                  
 10  T_Time        439 non-null    datetime64[ns, UTC+08:00]
 11  T_Comment     439 non-null    object                   
 12  Floor_Idx     1184 non-null   float64  

### 3.1.3 缺失的2条数据/The 2 missing entries

In [109]:
inter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   CommentID     3850 non-null   int64                    
 1   CommentTime   3850 non-null   datetime64[ns, UTC+08:00]
 2   RootID        3850 non-null   int64                    
 3   CommentRaw    3850 non-null   object                   
 4   CommentLike   3850 non-null   int64                    
 5   CommentReply  2664 non-null   Int64                    
 6   UserID        3850 non-null   int64                    
 7   UserName      3850 non-null   object                   
dtypes: Int64(1), datetime64[ns, UTC+08:00](1), int64(4), object(2)
memory usage: 244.5+ KB


提取出原始数据表`inter_data`的DataFrame索引，和`df`的索引`F_Idx`作比较

Extract the DataFrame Index of the original `inter_data`, and then compare it with the Index `F_Idx` of `df` to find the 2 missing entries

In [110]:
ori_commIdx = list(inter_data.index)
len(ori_commIdx)

3850

In [111]:
f_idx = list(df['F_Idx'])
len(f_idx)

3848

确认：216和218

Identify 216 and 218 as the 2 missing entries

In [112]:
list(set(ori_commIdx) - set(f_idx))

[216, 218]

#### (1) 216

从`inter_data`里提取216的相关信息

Extract the 216 information from the `inter_data`

In [113]:
inter_data.loc[216]

CommentID                                        4415493412865414
CommentTime                             2019-09-12 00:25:27+08:00
RootID                                           4404535403395100
CommentRaw      唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...
CommentLike                                                     0
CommentReply                                                 <NA>
UserID                                                 6530036372
UserName                                                       淐馮
Name: 216, dtype: object

In [114]:
df[df['F_CommID']==4404535403395100]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
125,215,4404535403395100,2019-08-12 18:42:15+08:00,6940897092,元首的胖次00658,第一次上微博，实话说吧，微博基本上是垃圾场，粪坑，看不下去。原神是借鉴了，但根本不是照着源代...,,,,,NaT,,,,


我们可以发现，作为Floor user的“元首的胖次00658”，尽管收到了来自“淐馮”的回复，CommentReply的值却是0而不是1<br>
同样的事情发生在了217和218上，这将在下一个Section进行处理

We can find that, as a Floor user, the CommentReply of "元首的胖次00658" is 0 rather than 1; But he/she was exactly replied by the user "淐馮"<br>
This is the same regarding Index 217 and 218, which will be handled in the next Section.

In [115]:
inter_data[(inter_data['CommentID']==4404535403395100) | (inter_data['RootID']==4404535403395100)]

Unnamed: 0,CommentID,CommentTime,RootID,CommentRaw,CommentLike,CommentReply,UserID,UserName
215,4404535403395100,2019-08-12 18:42:15+08:00,0,第一次上微博，实话说吧，微博基本上是垃圾场，粪坑，看不下去。原神是借鉴了，但根本不是照着源代...,4,0.0,6940897092,元首的胖次00658
216,4415493412865414,2019-09-12 00:25:27+08:00,4404535403395100,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...,0,,6530036372,淐馮


In [116]:
inter_data.loc[213:219]

Unnamed: 0,CommentID,CommentTime,RootID,CommentRaw,CommentLike,CommentReply,UserID,UserName
213,4404685299073412,2019-08-13 04:37:53+08:00,0,节奏大师好多，我顶不住了，这个社会，唉，当年流浪地球我也去顶过，和喷子对线。微博真的烂了，捧...,8,0.0,6286443624,神道天穹
214,4404676025439528,2019-08-13 04:01:02+08:00,0,厂商做一款游戏出来是希望所有玩家来玩并且喜欢，不是逼着所有玩家来玩，你觉得不好你可以不玩不看...,6,1.0,7280131819,野炊爆炸
215,4404535403395100,2019-08-12 18:42:15+08:00,0,第一次上微博，实话说吧，微博基本上是垃圾场，粪坑，看不下去。原神是借鉴了，但根本不是照着源代...,4,0.0,6940897092,元首的胖次00658
216,4415493412865414,2019-09-12 00:25:27+08:00,4404535403395100,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...,0,,6530036372,淐馮
217,4404499705840280,2019-08-12 16:20:24+08:00,0,真是太黑暗了啊[笑而不语],3,0.0,6829123837,万里的追寻者41043
218,4407171623915479,2019-08-20 01:17:39+08:00,4404499705840280,国产游戏的废土时代就此降临[笑而不语],0,,7087027730,云夜悠长
219,4404486724338196,2019-08-12 15:28:49+08:00,0,哇看到之后吓得我直接卸载了崩2和崩3，到了能不能退钱啊，算了卖号吧[二哈][二哈],0,0.0,5724524816,灰燕l


手动修改216有关的数据
- `df`：填上216的信息
- `comment_data`：更新215的CommentReply的值为1

Manually modify entries related to 216
- `df`: Add a row about 216 at the end of the current DataFrame
- `comment_data`: Update the CommentReply value of 215 from 0 to 1

In [117]:
len(df)

3848

In [118]:
idx216 = list(inter_data.loc[216])
idx216

[4415493412865414,
 Timestamp('2019-09-12 00:25:27+0800', tz='UTC+08:00'),
 4404535403395100,
 '唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕竟我一般不玩微博[doge]。',
 0,
 <NA>,
 6530036372,
 '淐馮']

In [119]:
idx215 = list(inter_data.loc[215])
idx215

[4404535403395100,
 Timestamp('2019-08-12 18:42:15+0800', tz='UTC+08:00'),
 0,
 '第一次上微博，实话说吧，微博基本上是垃圾场，粪坑，看不下去。原神是借鉴了，但根本不是照着源代码抄，原创游戏是不可能，就连塞尔达传说，伟大的战神都不是原创，我也不知道一群跟风狗瞎掺和什么，神秘海域和古墓丽影都没想你们这样撕。游戏定论尚不知道，理性评论，盲目跟风黑名单即可。',
 4,
 0,
 6940897092,
 '元首的胖次00658']

In [120]:
df.loc[3848,:] = [216, idx216[0], idx216[1], idx216[6], idx216[7], idx216[3], 215, idx215[6], idx215[7], idx215[0], idx215[1], idx215[3], 215, idx215[0], idx215[6]]
df.loc[3848]

F_Idx                                                       216.0
F_CommID                                       4415493412865414.0
F_Time                                  2019-09-12 00:25:27+08:00
F_UserID                                             6530036372.0
F_UserName                                                     淐馮
F_Comment       唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...
T_Idx                                                       215.0
T_UserID                                             6940897092.0
T_UserName                                             元首的胖次00658
T_CommID                                       4404535403395100.0
T_Time                                  2019-08-12 18:42:15+08:00
T_Comment       第一次上微博，实话说吧，微博基本上是垃圾场，粪坑，看不下去。原神是借鉴了，但根本不是照着源代...
Floor_Idx                                                   215.0
Floor_CommID                                   4404535403395100.0
Floor_UserID                                         6940897092.0
Name: 3848

In [121]:
comment_data.loc[215,'CommentReply'] = 1
comment_data.loc[215]

CommentID                                           4404535403395100
CommentTime                                2019-08-12 18:42:15+08:00
RootID                                                             0
CommentRaw         第一次上微博，实话说吧，微博基本上是垃圾场，粪坑，看不下去。原神是借鉴了，但根本不是照着源代...
Comment            第一次上微博，实话说吧，微博基本上是垃圾场，粪坑，看不下去。原神是借鉴了，但根本不是照着源代...
CommentLike                                                        4
CommentReply                                                       1
UserID                                                    6940897092
UserName                                                  元首的胖次00658
Province                                                          其他
ProvinceCode                                                       2
Region                                                          None
UserLocation                                                      其他
UserDescription                                                  NaN
UserGender                        

我们可以确定用户“元首的胖次00658”在`profile_data`（来自Section 2.8）里

We can confirm that user "元首的胖次00658" is in the `profile_data` (from Section 2.8)

In [122]:
profile_data[profile_data['UserID']==6940897092]

Unnamed: 0,UserID,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,ProvinceCode,Description,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
2629,6940897092,"[215, 281, 300]",元首的胖次00658,3,1,2,4,2,0,0,0,1,0,99,0,0,0,0


#### (2) 218

处理步骤和前面一致

The processing method is the same as the previous one

In [123]:
inter_data.loc[218,]

CommentID                4407171623915479
CommentTime     2019-08-20 01:17:39+08:00
RootID                   4404499705840280
CommentRaw            国产游戏的废土时代就此降临[笑而不语]
CommentLike                             0
CommentReply                         <NA>
UserID                         7087027730
UserName                             云夜悠长
Name: 218, dtype: object

In [124]:
idx217 = list(inter_data.loc[217])
idx217

[4404499705840280,
 Timestamp('2019-08-12 16:20:24+0800', tz='UTC+08:00'),
 0,
 '真是太黑暗了啊[笑而不语]',
 3,
 0,
 6829123837,
 '万里的追寻者41043']

In [125]:
idx218 = list(inter_data.loc[218])
idx218

[4407171623915479,
 Timestamp('2019-08-20 01:17:39+0800', tz='UTC+08:00'),
 4404499705840280,
 '国产游戏的废土时代就此降临[笑而不语]',
 0,
 <NA>,
 7087027730,
 '云夜悠长']

In [126]:
df.loc[3849,:] = [218, idx218[0], idx218[1], idx218[6], idx218[7], idx218[3], 217, idx217[6], idx217[7], idx217[0], idx217[1], idx217[3], 217, idx217[0], idx217[6]]
df.loc[3849]

F_Idx                               218.0
F_CommID               4407171623915479.0
F_Time          2019-08-20 01:17:39+08:00
F_UserID                     7087027730.0
F_UserName                           云夜悠长
F_Comment             国产游戏的废土时代就此降临[笑而不语]
T_Idx                               217.0
T_UserID                     6829123837.0
T_UserName                    万里的追寻者41043
T_CommID               4404499705840280.0
T_Time          2019-08-12 16:20:24+08:00
T_Comment                   真是太黑暗了啊[笑而不语]
Floor_Idx                           217.0
Floor_CommID           4404499705840280.0
Floor_UserID                 6829123837.0
Name: 3849, dtype: object

In [127]:
comment_data.loc[217,'CommentReply']=1
comment_data.loc[217]

CommentID                   4404499705840280
CommentTime        2019-08-12 16:20:24+08:00
RootID                                     0
CommentRaw                     真是太黑暗了啊[笑而不语]
Comment                              真是太黑暗了啊
CommentLike                                3
CommentReply                               1
UserID                            6829123837
UserName                         万里的追寻者41043
Province                                  其他
ProvinceCode                               2
Region                                  None
UserLocation                              其他
UserDescription                          NaN
UserGender                                 1
UserFan                                    4
UserFollow                                28
UserWeibo                                 70
UserVerified                               0
LoyalFan                                   0
VipRank                                    0
Name: 217, dtype: object

### 3.1.4 修改数据类型提高可读性/Change the data type to improve readability

In [128]:
df.head()

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
0,0.0,4915862000000000.0,2023-06-23 18:31:24+08:00,7752387000.0,祎只狸猫,考古,,,,,NaT,,,,
1,1.0,4910942000000000.0,2023-06-10 04:41:48+08:00,7604059000.0,邮一棵草莓i,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,,,,,NaT,,,,
2,2.0,4862812000000000.0,2023-01-28 09:09:22+08:00,7577284000.0,kmimg7,真不知道迅哥给评论区投了多少米？[允悲][doge],,,,,NaT,,,,
3,3.0,4862706000000000.0,2023-01-28 02:07:20+08:00,7476902000.0,小祀弟弟吖,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],,,,,NaT,,,,
4,4.0,4858912000000000.0,2023-01-17 14:51:54+08:00,7724650000.0,不知道如何评价,[吃瓜]现在是2023年 回来考古的点个赞,,,,,NaT,,,,


In [129]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3850 entries, 0 to 3849
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   F_Idx         3850 non-null   float64                  
 1   F_CommID      3850 non-null   float64                  
 2   F_Time        3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID      3850 non-null   float64                  
 4   F_UserName    3850 non-null   object                   
 5   F_Comment     3850 non-null   object                   
 6   T_Idx         441 non-null    float64                  
 7   T_UserID      441 non-null    float64                  
 8   T_UserName    1187 non-null   object                   
 9   T_CommID      441 non-null    float64                  
 10  T_Time        441 non-null    datetime64[ns, UTC+08:00]
 11  T_Comment     441 non-null    object                   
 12  Floor_Idx     1186 non-null   float64  

From类无数据缺失，转换为int64的数据类型

There is no missing value regarding "From". Transform the related data types to "int64"

In [130]:
df[['F_Idx','F_CommID','F_UserID']] = df[['F_Idx','F_CommID','F_UserID']].astype('int64')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3850 entries, 0 to 3849
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   F_Idx         3850 non-null   int64                    
 1   F_CommID      3850 non-null   int64                    
 2   F_Time        3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID      3850 non-null   int64                    
 4   F_UserName    3850 non-null   object                   
 5   F_Comment     3850 non-null   object                   
 6   T_Idx         441 non-null    float64                  
 7   T_UserID      441 non-null    float64                  
 8   T_UserName    1187 non-null   object                   
 9   T_CommID      441 non-null    float64                  
 10  T_Time        441 non-null    datetime64[ns, UTC+08:00]
 11  T_Comment     441 non-null    object                   
 12  Floor_Idx     1186 non-null   float64  

To类有数据缺失，转换为float的数据类型、1位小数点

There are no missing values regarding "To". Transform the related data types to "float" with 1 decimal place

In [131]:
df[df['T_Time'].notna()]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
2462,10,4836105126150507,2022-11-15 16:26:14+08:00,7772408887,bo_白色大月亮,这不是三年吗？[doge][doge],9.0,7.776168e+09,养着兔子的猫咪,4.832609e+15,2022-11-06 00:52:05+08:00,两年了，回过头来看[打call][打call][打call][打call],9.0,4.832609e+15,7.776168e+09
2464,19,4806342436983668,2022-08-25 13:19:57+08:00,5643636178,夕神心音,就喜欢挖坟看看以前的有趣发言[doge],18.0,7.415480e+09,书一点禾,4.747356e+15,2022-03-15 18:50:28+08:00,前面的评论跟现在完全不一样，全都打脸场面,18.0,4.747356e+15,7.415480e+09
2466,22,4836107903042323,2022-11-15 16:37:16+08:00,7772408887,bo_白色大月亮,烦内[doge],21.0,5.952984e+09,罗小黑本喵_,4.745117e+15,2022-03-09 14:31:22+08:00,阿伟你又在翻看古战场诶 休息一下吧[doge],21.0,4.745117e+15,5.952984e+09
2467,23,4802732529032035,2022-08-15 14:15:28+08:00,7752842167,名字就叫旧林,苏删,21.0,5.952984e+09,罗小黑本喵_,4.745117e+15,2022-03-09 14:31:22+08:00,阿伟你又在翻看古战场诶 休息一下吧[doge],21.0,4.745117e+15,5.952984e+09
2468,24,4798073790794371,2022-08-02 17:43:18+08:00,7351240289,才太晚,[doge],21.0,5.952984e+09,罗小黑本喵_,4.745117e+15,2022-03-09 14:31:22+08:00,阿伟你又在翻看古战场诶 休息一下吧[doge],21.0,4.745117e+15,5.952984e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3833,3839,4380985154451322,2019-06-08 19:01:58+08:00,2950862475,男人的浪漫是剑风传奇,建议买一个NS，然后选择购买塞尔达，就可以提前玩到原神了，还不用担心流量，超棒呢[太开心],3821.0,6.201565e+09,兔纸今天能摸到鱼吗,4.380880e+15,2019-06-08 12:03:41+08:00,这画质我怕手机带不动啊[泪],3821.0,4.380880e+15,6.201565e+09
3834,3841,4380977293759679,2019-06-08 18:30:44+08:00,5646067604,Aki是乌龟,建议买一个NS，然后选择购买塞尔达，就可以提前玩到原神了，还不用担心流量，超棒呢[太开心],3821.0,6.201565e+09,兔纸今天能摸到鱼吗,4.380880e+15,2019-06-08 12:03:41+08:00,这画质我怕手机带不动啊[泪],3821.0,4.380880e+15,6.201565e+09
3835,3842,4380883555044975,2019-06-08 12:18:15+08:00,5165585956,看我弹死你这个猪皮,我的mini5已经饥渴难耐[馋嘴][馋嘴],3821.0,6.201565e+09,兔纸今天能摸到鱼吗,4.380880e+15,2019-06-08 12:03:41+08:00,这画质我怕手机带不动啊[泪],3821.0,4.380880e+15,6.201565e+09
3848,216,4415493412865414,2019-09-12 00:25:27+08:00,6530036372,淐馮,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...,215.0,6.940897e+09,元首的胖次00658,4.404535e+15,2019-08-12 18:42:15+08:00,第一次上微博，实话说吧，微博基本上是垃圾场，粪坑，看不下去。原神是借鉴了，但根本不是照着源代...,215.0,4.404535e+15,6.940897e+09


In [132]:
pd.set_option('display.float_format', '{:.1f}'.format)

In [133]:
df[df['T_Time'].notna()]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
2462,10,4836105126150507,2022-11-15 16:26:14+08:00,7772408887,bo_白色大月亮,这不是三年吗？[doge][doge],9.0,7776168152.0,养着兔子的猫咪,4832608544361987.0,2022-11-06 00:52:05+08:00,两年了，回过头来看[打call][打call][打call][打call],9.0,4832608544361987.0,7776168152.0
2464,19,4806342436983668,2022-08-25 13:19:57+08:00,5643636178,夕神心音,就喜欢挖坟看看以前的有趣发言[doge],18.0,7415480335.0,书一点禾,4747356392395706.0,2022-03-15 18:50:28+08:00,前面的评论跟现在完全不一样，全都打脸场面,18.0,4747356392395706.0,7415480335.0
2466,22,4836107903042323,2022-11-15 16:37:16+08:00,7772408887,bo_白色大月亮,烦内[doge],21.0,5952984271.0,罗小黑本喵_,4745116860550884.0,2022-03-09 14:31:22+08:00,阿伟你又在翻看古战场诶 休息一下吧[doge],21.0,4745116860550884.0,5952984271.0
2467,23,4802732529032035,2022-08-15 14:15:28+08:00,7752842167,名字就叫旧林,苏删,21.0,5952984271.0,罗小黑本喵_,4745116860550884.0,2022-03-09 14:31:22+08:00,阿伟你又在翻看古战场诶 休息一下吧[doge],21.0,4745116860550884.0,5952984271.0
2468,24,4798073790794371,2022-08-02 17:43:18+08:00,7351240289,才太晚,[doge],21.0,5952984271.0,罗小黑本喵_,4745116860550884.0,2022-03-09 14:31:22+08:00,阿伟你又在翻看古战场诶 休息一下吧[doge],21.0,4745116860550884.0,5952984271.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3833,3839,4380985154451322,2019-06-08 19:01:58+08:00,2950862475,男人的浪漫是剑风传奇,建议买一个NS，然后选择购买塞尔达，就可以提前玩到原神了，还不用担心流量，超棒呢[太开心],3821.0,6201564748.0,兔纸今天能摸到鱼吗,4380879889748007.0,2019-06-08 12:03:41+08:00,这画质我怕手机带不动啊[泪],3821.0,4380879889748007.0,6201564748.0
3834,3841,4380977293759679,2019-06-08 18:30:44+08:00,5646067604,Aki是乌龟,建议买一个NS，然后选择购买塞尔达，就可以提前玩到原神了，还不用担心流量，超棒呢[太开心],3821.0,6201564748.0,兔纸今天能摸到鱼吗,4380879889748007.0,2019-06-08 12:03:41+08:00,这画质我怕手机带不动啊[泪],3821.0,4380879889748007.0,6201564748.0
3835,3842,4380883555044975,2019-06-08 12:18:15+08:00,5165585956,看我弹死你这个猪皮,我的mini5已经饥渴难耐[馋嘴][馋嘴],3821.0,6201564748.0,兔纸今天能摸到鱼吗,4380879889748007.0,2019-06-08 12:03:41+08:00,这画质我怕手机带不动啊[泪],3821.0,4380879889748007.0,6201564748.0
3848,216,4415493412865414,2019-09-12 00:25:27+08:00,6530036372,淐馮,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...,215.0,6940897092.0,元首的胖次00658,4404535403395100.0,2019-08-12 18:42:15+08:00,第一次上微博，实话说吧，微博基本上是垃圾场，粪坑，看不下去。原神是借鉴了，但根本不是照着源代...,215.0,4404535403395100.0,6940897092.0


## 3.2 检查与修正：T_UserID和T_UserName的空值数量不对等/Review and fix: incorrect number of null values for T_UserID and T_UserName

在Section 3.1.4中的`df.info()`可以看到，T_UserName的空值远多于T_UserID的空值。这是不合理的，应当想办法解决

From `df.info()` in the Section 3.1.4, it is clear that the number of nulls for T_UserName is much larger than that for T_UserID. This is irrational, and we should find some ways to address this problem

In [134]:
df[(df['T_UserName'].notna()) & (df['T_UserID'].isnull())]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
2481,57,4661616077177105,2021-07-22 04:29:03+08:00,7375957830,黑鸭嗨哟,回复@蠢萌萌66:毕竟当初被带节奏的总不能打自己的脸，只能选择性忽略当个无脑黑了[吃瓜],,,蠢萌萌66,,NaT,,56.0,4460703643517009.0,7351663044.0
2493,78,4472599378056641,2020-02-16 14:24:11+08:00,7087027730,云夜悠长,回复@孙笑川5655:哈哈哈哈哈哈哈，装逼谁不会呀，扯个外星人就是高端数码粉嗷，牛逼牛逼[哈哈],,,孙笑川5655,,NaT,,77.0,4452751050341087.0,7217990194.0
2494,80,4460825508899118,2020-01-15 02:39:01+08:00,7217990194,孙笑川5655,回复@有毒的茶茶:爷要换外星人Alienware了，你们就继续吹吧，只有手机的白嫖党们,,,有毒的茶茶,,NaT,,77.0,4452751050341087.0,7217990194.0
2495,83,4459449890115897,2020-01-11 07:32:48+08:00,7217990194,孙笑川5655,回复@慧骃要成为本子画师:他是抄袭塞尔达的，你看不出来吗。我有整个塞尔达系列的所有卡带。你这...,,,慧骃要成为本子画师,,NaT,,77.0,4452751050341087.0,7217990194.0
2507,103,4460676678814525,2020-01-14 16:47:37+08:00,6192175029,Zh40q14NcheN,回复@我伊布贼溜:老任把原神放进ns了，一定是米给任天堂塞钱了，作为任豚真的有被冒犯到[怒],,,我伊布贼溜,,NaT,,102.0,4441023834998487.0,6198559615.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3843,3835,4380992594400846,2019-06-08 19:31:32+08:00,6201564748,兔纸今天能摸到鱼吗,回复@沙奈朵的裙底到底有什么:是谁逗谁笑也请你整清楚，我也明确表态了不想吵，阴阳怪气的回复大...,,,沙奈朵的裙底到底有什么,,NaT,,3821.0,4380879889748007.0,6201564748.0
3844,3836,4380988769262132,2019-06-08 19:16:20+08:00,2950862475,男人的浪漫是剑风传奇,回复@怕事先改名shaw:[doge]求求你别玩任天堂，你也不想想是谁不配,,,怕事先改名shaw,,NaT,,3821.0,4380879889748007.0,6201564748.0
3845,3837,4380988346320255,2019-06-08 19:14:39+08:00,5646067604,Aki是乌龟,回复@怕事先改名shaw:你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就...,,,怕事先改名shaw,,NaT,,3821.0,4380879889748007.0,6201564748.0
3846,3838,4380987008064553,2019-06-08 19:09:20+08:00,6201564748,兔纸今天能摸到鱼吗,回复@沉迷艾欧泽亚的菜菜啊:🐮🍺，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这...,,,沉迷艾欧泽亚的菜菜啊,,NaT,,3821.0,4380879889748007.0,6201564748.0


一个可行的流程是
- 先借助user_data里的UserID和UserName与T_UserName进行匹配，尽可能的补完T_UserID
- 对于那些在user_data里从未出现过的、但是却出现在T_UserName里的用户名，推测是这些用户的原始评论因为某些原因在数据收集时间前就被删去。我们的model将会考虑到这种情况

A possible process is 
- Fill the null values in T_UserID via the name-id pair (i.e., UserID and UserName) in user_data as much as possible
- For those usernames that never appear in user_data but do appear in T_UserName, we can assume that the original comments of these users were deleted for some reason before the time of data collection. Our model will take this into account

In [135]:
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 24 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   UserID           2803 non-null   int64 
 1   FirstIndex       2803 non-null   int64 
 2   IndexList        2803 non-null   object
 3   UserName         2803 non-null   object
 4   TotalComment     2803 non-null   int64 
 5   Comment          2803 non-null   int64 
 6   Reply            2803 non-null   int64 
 7   LikeCount        2803 non-null   int64 
 8   Province         2803 non-null   object
 9   ProvinceCode     2803 non-null   int64 
 10  Region           1277 non-null   object
 11  UserLocation     2803 non-null   object
 12  Description      2803 non-null   int64 
 13  UserDescription  1735 non-null   object
 14  DescriClean      1654 non-null   object
 15  DescriptionLen   2803 non-null   int64 
 16  SpecialChar      2803 non-null   int64 
 17  UserGender       2803 non-null   

In [136]:
name_id = user_data.iloc[:,[0,3]]
name_id

Unnamed: 0,UserID,UserName
0,1001914040,薪火鹏
1,1008309912,提尔乌斯
2,1025900974,猫的摇篮-伪物
3,1028179843,非常神奇的老z
4,1035744261,假装很强的萌新
...,...,...
2798,7755717663,寂月海200007
2799,7766444420,烛虚cron
2800,7772408887,bo_白色大月亮
2801,7774567481,你好陈博


In [137]:
df = pd.merge(df, name_id, how='left', left_on='T_UserName', right_on='UserName')
df.head()

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID,UserID,UserName
0,0,4915861950828535,2023-06-23 18:31:24+08:00,7752387333,祎只狸猫,考古,,,,,NaT,,,,,,
1,1,4910942133158267,2023-06-10 04:41:48+08:00,7604058796,邮一棵草莓i,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,,,,,NaT,,,,,,
2,2,4862811883439433,2023-01-28 09:09:22+08:00,7577283965,kmimg7,真不知道迅哥给评论区投了多少米？[允悲][doge],,,,,NaT,,,,,,
3,3,4862705678417948,2023-01-28 02:07:20+08:00,7476902376,小祀弟弟吖,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],,,,,NaT,,,,,,
4,4,4858911816944130,2023-01-17 14:51:54+08:00,7724649649,不知道如何评价,[吃瓜]现在是2023年 回来考古的点个赞,,,,,NaT,,,,,,


使数据更加可读

Make the dataset more readable

In [138]:
merge_uID = df['UserID']
merge_uID

0               NaN
1               NaN
2               NaN
3               NaN
4               NaN
           ...     
3845            NaN
3846            NaN
3847            NaN
3848   6940897092.0
3849   6829123837.0
Name: UserID, Length: 3850, dtype: float64

In [139]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   F_Idx         3850 non-null   int64                    
 1   F_CommID      3850 non-null   int64                    
 2   F_Time        3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID      3850 non-null   int64                    
 4   F_UserName    3850 non-null   object                   
 5   F_Comment     3850 non-null   object                   
 6   T_Idx         441 non-null    float64                  
 7   T_UserID      441 non-null    float64                  
 8   T_UserName    1187 non-null   object                   
 9   T_CommID      441 non-null    float64                  
 10  T_Time        441 non-null    datetime64[ns, UTC+08:00]
 11  T_Comment     441 non-null    object                   
 12  Floor_Idx     1186 non-null   floa

In [140]:
df = df.drop(columns=['T_UserID','UserID','UserName'],axis=1)
df.insert(7,'T_UserID', merge_uID)

In [141]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   F_Idx         3850 non-null   int64                    
 1   F_CommID      3850 non-null   int64                    
 2   F_Time        3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID      3850 non-null   int64                    
 4   F_UserName    3850 non-null   object                   
 5   F_Comment     3850 non-null   object                   
 6   T_Idx         441 non-null    float64                  
 7   T_UserID      907 non-null    float64                  
 8   T_UserName    1187 non-null   object                   
 9   T_CommID      441 non-null    float64                  
 10  T_Time        441 non-null    datetime64[ns, UTC+08:00]
 11  T_Comment     441 non-null    object                   
 12  Floor_Idx     1186 non-null   floa

成功的将T_UserID的非空值从441增加到了907。当然，还是有280个用户没有UserID。<br>
这些用户可能是下面俩种情况之一
- 他们的评论已经被删除
- 他们仅被他人@，未留下任何评论

我们的模型会考虑到这些用户

We successfully increase the number of non-null value of T_UserID from 441 to 907. But there are still 280 users without UserID.<br>
These users might be one of 2 categories below:
- Their comments were deleted
- They were only mentioned by someone else, and didn't leave any comments

Our model can take these users into consideration


In [142]:
df[(df['T_UserName'].notna()) & (df['T_UserID'].isnull())]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
2495,83,4459449890115897,2020-01-11 07:32:48+08:00,7217990194,孙笑川5655,回复@慧骃要成为本子画师:他是抄袭塞尔达的，你看不出来吗。我有整个塞尔达系列的所有卡带。你这...,,,慧骃要成为本子画师,,NaT,,77.0,4452751050341087.0,7217990194.0
2527,175,4410384465216570,2019-08-28 22:04:20+08:00,5491836726,奢求_82084,回复 @时间旅行机器:我也笑死了 同一时间 有一个同样模仿 同样抄袭的游戏 怎么就可着这个...,,,时间旅行机器,,NaT,,174.0,4408948503188967.0,5491836726.0
2534,187,4408497808137499,2019-08-23 17:07:26+08:00,2258141752,元气满满bou,回复@时间旅行机器: 是商业抄袭* 小小年纪眼睛就瞎了，心疼你,,,时间旅行机器,,NaT,,184.0,4407756130344107.0,2258141752.0
2535,188,4408048577861560,2019-08-22 11:22:21+08:00,2258141752,元气满满bou,回复@大头花阳: 你si去的爹告诉你的我不玩游戏？撒13玩意儿，睁眼瞎自欺欺人。就是有你这种...,,,大头花阳,,NaT,,184.0,4407756130344107.0,2258141752.0
2542,203,4408258518911734,2019-08-23 01:16:35+08:00,7087027730,云夜悠长,回复@今天原神凉了没:唉，不跟哈批对线了，也是没意思，反正对喷又没结果，最强法务部也不出手，...,,,今天原神凉了没,,NaT,,202.0,4405167526777662.0,6493310305.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3843,3835,4380992594400846,2019-06-08 19:31:32+08:00,6201564748,兔纸今天能摸到鱼吗,回复@沙奈朵的裙底到底有什么:是谁逗谁笑也请你整清楚，我也明确表态了不想吵，阴阳怪气的回复大...,,,沙奈朵的裙底到底有什么,,NaT,,3821.0,4380879889748007.0,6201564748.0
3844,3836,4380988769262132,2019-06-08 19:16:20+08:00,2950862475,男人的浪漫是剑风传奇,回复@怕事先改名shaw:[doge]求求你别玩任天堂，你也不想想是谁不配,,,怕事先改名shaw,,NaT,,3821.0,4380879889748007.0,6201564748.0
3845,3837,4380988346320255,2019-06-08 19:14:39+08:00,5646067604,Aki是乌龟,回复@怕事先改名shaw:你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就...,,,怕事先改名shaw,,NaT,,3821.0,4380879889748007.0,6201564748.0
3846,3838,4380987008064553,2019-06-08 19:09:20+08:00,6201564748,兔纸今天能摸到鱼吗,回复@沉迷艾欧泽亚的菜菜啊:🐮🍺，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这...,,,沉迷艾欧泽亚的菜菜啊,,NaT,,3821.0,4380879889748007.0,6201564748.0


## 3.3 后续处理1：移除文本中的“回复@username” ，并提取被@的用户名/Subsequent processing 1: remove "Reply@username", and extract the mentioned/@ username

In [143]:
h_graph = df.copy()
h_graph.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   F_Idx         3850 non-null   int64                    
 1   F_CommID      3850 non-null   int64                    
 2   F_Time        3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID      3850 non-null   int64                    
 4   F_UserName    3850 non-null   object                   
 5   F_Comment     3850 non-null   object                   
 6   T_Idx         441 non-null    float64                  
 7   T_UserID      907 non-null    float64                  
 8   T_UserName    1187 non-null   object                   
 9   T_CommID      441 non-null    float64                  
 10  T_Time        441 non-null    datetime64[ns, UTC+08:00]
 11  T_Comment     441 non-null    object                   
 12  Floor_Idx     1186 non-null   floa

### 3.3.1 从文本中移除回复pattern（如：回复@username:）/Remove reply pattern (e.g., Reply@username:) from the text

In [144]:
def replace_text(text):
    reply_pattern = r'回复\s?@([^\s:：]+):|回复\s?@([^\s:：]+)：'
    return re.sub(reply_pattern, '', text)

h_graph['F_Comment'] = h_graph['F_Comment'].apply(replace_text)

再次检查Section 3.1.2提及的3个特殊的username（“网恋被骗700万”，“醉筠子-不放弃本子”，“夜降萃梦乡_”）是否得到了正确的处理。

Double check whether those 3 special username ("网恋被骗700万", "醉筠子-不放弃本子" and "夜降萃梦乡_") mentioned in Section 3.1.2 are conducted successfully.

In [145]:
h_graph[h_graph['F_Idx']==3755]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
3812,3755,4380987938821807,2019-06-08 19:13:02+08:00,5596441213,法海爱一休,鹅厂抄就恶心，米哈游抄就是正事嗷[嘻嘻],,6459204506.0,网恋被骗700万,,NaT,,3685.0,4380883840572545.5,1957259477.0


In [146]:
h_graph[h_graph['F_Idx']==3048]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
3274,3048,4385914031449975,2019-06-22 09:27:34+08:00,5356469720,error1980,你说你🐎呢，看见个开放世界，看见滑翔翼就，看见卡通渲染就塞尔达。塞尔达无非就是把以前都已经有...,,,醉筠子-不放弃本子,,NaT,,2992.0,4380899494009653.0,6578279612.0


In [147]:
h_graph[h_graph['F_Idx']==557]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
2705,557,4389266353859602,2019-07-01 15:28:30+08:00,3281534190,开辟之星3,原神不是耻辱，你这种带节奏的才是耻辱，你倒是给我粗制滥造一个啊，某大厂新出的《龙族幻想》和《...,,,夜降萃梦乡_,,NaT,,,,


没有遗漏的未处理的数据

No missing unprocessed data

In [148]:
reply_pattern = r'回复\s?@([^\s:：]+):|回复\s?@([^\s:：]+)：'
h_graph[h_graph['F_Comment'].str.contains(reply_pattern, '')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID


### 3.3.2 提取被@的用户名/Extract the mentioned/@ username

这一类的文本特征为“...@username ...”，且有些用户@了多个人<br>
下面列出了一个关于@的简单例子：用户@了“原神”

Texts in this category is characterized by "... @username ...", and some users mentioned several people<br>
A simple example about @ is listed below. In this text, the user mentioned "原神" (shown as "@原神")

In [149]:
h_graph.loc[30,'F_Comment']

'作为米哈游科技（上海）有限公司制作发行的一款开放世界冒险游戏，画面风格精致富有美感，是一款自由度很高的手游大作，我真心想获得本次测试，也会抓住每一个机会，我会努力争取的！[太开心]@原神 旅行者，安柏，凯亚，琴，丽莎，芭芭拉，到时候一定与你们结友！共度时光@微博抽奖平台 加油！#原神#'

提取出带有@的文本并保存为一个新的csv文件`ForCheck`，以方便后续的手动筛查异常情况<br>

Extract all records with @ in the texts and save them as a new csv file `ForCheck`, which facilitate subsequent manual revision of exceptions 

In [150]:
alpha_pattern = r'@([^\s@]+)'
check_manu = h_graph[h_graph['F_Comment'].str.contains(alpha_pattern)]
check_manu

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID
7,7,4842368296289558,2022-12-02 23:13:51+08:00,7752939212,花月歌浮舟,考古结束@Akoi,,,,,NaT,,,,
30,45,4481688375200938,2020-03-12 16:20:37+08:00,6153913596,原神忠粉,作为米哈游科技（上海）有限公司制作发行的一款开放世界冒险游戏，画面风格精致富有美感，是一款自...,,,,,NaT,,,,
32,47,4461401621863634,2020-01-16 16:48:17+08:00,5648392127,DodLIke刂兆,等私信@_恭弥大人_,,,,,NaT,,,,
33,48,4461395208773600,2020-01-16 16:22:48+08:00,5210165672,wangxorz,@b1gcatttttttttt 来个测试资格吧[二哈],,,,,NaT,,,,
35,53,4460756835567135,2020-01-14 22:06:08+08:00,6555427619,葱花不开花,@相生栗子 嘿嘿，栗子大大出来装个好友帮个忙[太开心][太开心],,,,,NaT,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3159,2772,4380924667828394,2019-06-08 15:01:37+08:00,5554932231,A大调的华尔兹,很期待新作@Rusi_,,,,,NaT,,,,
3200,2933,4380905709610665,2019-06-08 13:46:17+08:00,2688011343,yukin725,@Rioshady 塞尔达旷野之息…而且估计操作感差很多[允悲][允悲],,,,,NaT,,,,
3558,3419,4380893496035358,2019-06-08 12:57:45+08:00,1776100951,ReIKiTsuNeNeGi,@云玩家FlameBeam 手机玩家的钱真好赚？,,,,,NaT,,,,
3576,3545,4381283213536790,2019-06-09 14:46:20+08:00,6301774190,倾城恋怀,@任天堂香港有限公司,3506.0,6500909470.0,oO土豆泥O,4380885732287240.0,2019-06-08 12:26:53+08:00,这……有些地方一摸一样吧,3506.0,4380885732287240.0,6500909470.0


In [151]:
check_file='Datasets/ForCheck.csv' 
check_manu.to_csv(check_file, index=False,encoding='utf-8-sig')

使用alpha_pattern尽可能的提取出被@的用户名

Use `alpha_pattern` to extract mentioned/@ username as much as possible

In [152]:
at_Idx = list(check_manu.index)
len(at_Idx)

1009

In [153]:
h_graph.loc[at_Idx, 'At_User'] = h_graph.loc[at_Idx, 'F_Comment'].str.findall(alpha_pattern)
h_graph[h_graph['At_User'].notna()]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID,At_User
7,7,4842368296289558,2022-12-02 23:13:51+08:00,7752939212,花月歌浮舟,考古结束@Akoi,,,,,NaT,,,,,[Akoi]
30,45,4481688375200938,2020-03-12 16:20:37+08:00,6153913596,原神忠粉,作为米哈游科技（上海）有限公司制作发行的一款开放世界冒险游戏，画面风格精致富有美感，是一款自...,,,,,NaT,,,,,"[原神, 微博抽奖平台]"
32,47,4461401621863634,2020-01-16 16:48:17+08:00,5648392127,DodLIke刂兆,等私信@_恭弥大人_,,,,,NaT,,,,,[_恭弥大人_]
33,48,4461395208773600,2020-01-16 16:22:48+08:00,5210165672,wangxorz,@b1gcatttttttttt 来个测试资格吧[二哈],,,,,NaT,,,,,[b1gcatttttttttt]
35,53,4460756835567135,2020-01-14 22:06:08+08:00,6555427619,葱花不开花,@相生栗子 嘿嘿，栗子大大出来装个好友帮个忙[太开心][太开心],,,,,NaT,,,,,[相生栗子]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3159,2772,4380924667828394,2019-06-08 15:01:37+08:00,5554932231,A大调的华尔兹,很期待新作@Rusi_,,,,,NaT,,,,,[Rusi_]
3200,2933,4380905709610665,2019-06-08 13:46:17+08:00,2688011343,yukin725,@Rioshady 塞尔达旷野之息…而且估计操作感差很多[允悲][允悲],,,,,NaT,,,,,[Rioshady]
3558,3419,4380893496035358,2019-06-08 12:57:45+08:00,1776100951,ReIKiTsuNeNeGi,@云玩家FlameBeam 手机玩家的钱真好赚？,,,,,NaT,,,,,[云玩家FlameBeam]
3576,3545,4381283213536790,2019-06-09 14:46:20+08:00,6301774190,倾城恋怀,@任天堂香港有限公司,3506.0,6500909470.0,oO土豆泥O,4380885732287240.0,2019-06-08 12:26:53+08:00,这……有些地方一摸一样吧,3506.0,4380885732287240.0,6500909470.0,[任天堂香港有限公司]


经过手动检查后，F_Idx为下列值的数据在@的提取上存在问题。我们将在下面的部分逐一处理
- 1139 1239 860 1418 1465 1702 2109 2274 2779 3149 3779 2538 2653

After manual checking, records with the following values of F_Idx have problems with mentioned username extraction. We will handle this data in the following sections (from (1) to (3))

#### (1) F_Idx=1139
该用户的“@官方”中的@表达的是单词“艾特”，是用特殊符号来表达对应的中文含义，并非@了一个叫做“官方”的用户。在At_User中保留“原神”即可

The @ in this user's "@官方" is an abbreviation for the Chinese word "艾特 (mention)", rather than mentioning a user called "官方". Just keep "原神" in At_User.

In [154]:
h_graph[h_graph['F_Idx']==1139]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID,At_User
815,1139,4384309676961753,2019-06-17 23:12:25+08:00,2208067411,GanChongren,@原神 @官方，我真是小机灵鬼[笑而不语],,,,,NaT,,,,,"[原神, 官方，我真是小机灵鬼[笑而不语]]"


In [155]:
h_graph.loc[815,'At_User'] = ['{}'.format(h_graph.loc[815,'At_User'][0])]
h_graph[h_graph['F_Idx']==1139]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID,At_User
815,1139,4384309676961753,2019-06-17 23:12:25+08:00,2208067411,GanChongren,@原神 @官方，我真是小机灵鬼[笑而不语],,,,,NaT,,,,,[原神]


#### (2) F_Idx  $\in [1239, 860, 1418, 1465, 1702, 2109, 2274, 2779, 3149, 3779, 2538, 2653] $

In [156]:
at_list = [1239,860,1418,1465,1702,2109,2274,2779,3149,3779,2538,2653]
at_list

[1239, 860, 1418, 1465, 1702, 2109, 2274, 2779, 3149, 3779, 2538, 2653]

In [157]:
at1_set = h_graph[h_graph['F_Comment'].str.contains(r'@([^:]+):')]
at1_set

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID,At_User
562,860,4385957564129314,2019-06-22 12:20:33+08:00,3914447813,老老老老老那啊,你们是真不要脸了，哪怕改一改也行啊，复制粘贴不太好吧弟弟//@我再也不想写代码了 :你们真的...,,,,,NaT,,,,,[我再也不想写代码了]
912,1239,4383085795509573,2019-06-14 14:09:10+08:00,6449314474,十香Princess2018,//@糖醋SAO排骨:,,,,,NaT,,,,,[糖醋SAO排骨:]
929,1260,4383059581044564,2019-06-14 12:25:00+08:00,5264335436,忧鱼与熊掌,@襟上花r 琪亚娜 http://t.cn/AiNhbK5k,,,,,NaT,,,,,[襟上花r]
1073,1418,4382090839463454,2019-06-11 20:15:33+08:00,2482877025,昊哥昊哥耗,真实//@花盆栽柳树谁也拦不住:国内厂商总以为玩家是傻子，东抄抄西抄抄就能出来骗钱[微笑]然...,,,,,NaT,,,,,[花盆栽柳树谁也拦不住:国内厂商总以为玩家是傻子，东抄抄西抄抄就能出来骗钱[微笑]然而事实是...
1088,1465,4382003786816490,2019-06-11 14:29:39+08:00,2796207394,不抽烟CHIRS,#原神# ▶序章PV：捕风的异乡人◀ 维系者正在死去，创造者尚未到来。 但世界不会再度灼烧，...,,,,,NaT,,,,,"[原神, 微博抽奖平台, kevenhu]"
1316,1702,4381267762552372,2019-06-09 13:44:56+08:00,1740544170,不爱发博王左军,@不爱写字王右军 //@游研社: 米哈游新作《原神》放出了新预告,,,,,NaT,,,,,"[不爱写字王右军, 游研社:]"
1396,1786,4381190587010404,2019-06-09 08:38:16+08:00,1093773890,Luciferbear,@只为等待your PS4 PRO http://t.cn/AiC5X41g,,,,,NaT,,,,,[只为等待your]
1655,2109,4380996277150017,2019-06-08 19:46:10+08:00,3218789155,天雷牙皇,//@游研社: 米哈游新作《原神》放出了新预告,,,,,NaT,,,,,[游研社:]
2075,2779,4380924198016900,2019-06-08 14:59:45+08:00,1765337557,DDDDDlno,边抄边加了很多新的，光看pv还是很舒服的。希望能讲好自己的故事。//@游民星空: 米哈游新作...,,,,,NaT,,,,,[游民星空:]
2265,3149,4380897346334849,2019-06-08 13:13:03+08:00,5846473488,幸运的血玫瑰男爵,@Mr_陈家铧_924_106 @神座出流_ //@游侠网: 来了来了[并不简单][并不简单],,,,,NaT,,,,,"[Mr_陈家铧_924_106, 神座出流_, 游侠网:]"


除了F_Idx=2274的用户之外，at_list中所有的index都出现在了at1_set里。<br>
经过手动检查，可以发现2种情况
1. 用户名后面的冒号没有移除，如“@糖醋SAO排骨:”涉及的用户名被识别为“糖醋SAO排骨:”而非“糖醋SAO排骨”
2. 冒号后面的文本被错误的识别为用户名的一部分，如F_Idx=1418 (DataFrame Idx=1073)的记录<br>
使用下面的代码可以检查出第一种情况涉及的用户（合计9人）

All indices in at_list appear in at1_set, except F_Idx=2274<br>
After manual checking, 2 cases can be identified
1. The colom (i.e., `:`) after the username is not removed, e.g., the username extracted from "@糖醋SAO排骨:" should be "糖醋SAO排骨" rather than "糖醋SAO排骨:"
2. The text following the colon is incorrectly identified as part of the username. F_Idx=1418 (DataFrame Idx=1073) is an example of this case.<br>
Users (9 users in total) from the first category can be filtered via the following code

In [158]:
def test_user(lst):
    return any(element.endswith(':') and element[:-1] != '' for element in lst)

for_test = h_graph[h_graph['At_User'].notna()]
len(for_test[for_test['At_User'].apply(test_user)])

9

In [159]:
for_test[for_test['At_User'].apply(test_user)]['At_User']

912                        [糖醋SAO排骨:]
1316                  [不爱写字王右军, 游研社:]
1655                           [游研社:]
2075                          [游民星空:]
2265    [Mr_陈家铧_924_106, 神座出流_, 游侠网:]
2415                   [半透明黑桶:, 寒天黑糖]
2985                      [不吃生肉也不重名:]
2998                     [error1980:]
3084                         [独孤剑萧萧:]
Name: At_User, dtype: object

使用自定义函数process_at将At_User中，username后面的冒号删去

Remove the colon after the username in At_User via def dunction process_at

In [160]:
def process_at(text):
    return [item.split(':')[0] for item in text]
at1_set['At_User'].apply(process_at)

562                      [我再也不想写代码了]
912                        [糖醋SAO排骨]
929                           [襟上花r]
1073                    [花盆栽柳树谁也拦不住]
1088           [原神, 微博抽奖平台, kevenhu]
1316                  [不爱写字王右军, 游研社]
1396                      [只为等待your]
1655                           [游研社]
2075                          [游民星空]
2265    [Mr_陈家铧_924_106, 神座出流_, 游侠网]
2415                   [半透明黑桶, 寒天黑糖]
2985                      [不吃生肉也不重名]
2998                     [error1980]
3084                         [独孤剑萧萧]
Name: At_User, dtype: object

In [161]:
at1_Idx = list(at1_set.index)
at1_Idx

[562,
 912,
 929,
 1073,
 1088,
 1316,
 1396,
 1655,
 2075,
 2265,
 2415,
 2985,
 2998,
 3084]

In [162]:
# 修改前/Before the revision
h_graph.loc[at1_Idx,'At_User']

562                                           [我再也不想写代码了]
912                                            [糖醋SAO排骨:]
929                                                [襟上花r]
1073    [花盆栽柳树谁也拦不住:国内厂商总以为玩家是傻子，东抄抄西抄抄就能出来骗钱[微笑]然而事实是...
1088                                [原神, 微博抽奖平台, kevenhu]
1316                                      [不爱写字王右军, 游研社:]
1396                                           [只为等待your]
1655                                               [游研社:]
2075                                              [游民星空:]
2265                        [Mr_陈家铧_924_106, 神座出流_, 游侠网:]
2415                                       [半透明黑桶:, 寒天黑糖]
2985                                          [不吃生肉也不重名:]
2998                                         [error1980:]
3084                                             [独孤剑萧萧:]
Name: At_User, dtype: object

完成修改

Finish the revision

In [163]:
h_graph.loc[at1_Idx,'At_User'] = list(at1_set['At_User'].apply(process_at))
h_graph.loc[at1_Idx,'At_User']

562                      [我再也不想写代码了]
912                        [糖醋SAO排骨]
929                           [襟上花r]
1073                    [花盆栽柳树谁也拦不住]
1088           [原神, 微博抽奖平台, kevenhu]
1316                  [不爱写字王右军, 游研社]
1396                      [只为等待your]
1655                           [游研社]
2075                          [游民星空]
2265    [Mr_陈家铧_924_106, 神座出流_, 游侠网]
2415                   [半透明黑桶, 寒天黑糖]
2985                      [不吃生肉也不重名]
2998                     [error1980]
3084                         [独孤剑萧萧]
Name: At_User, dtype: object

idx=1088的Kevenhu后面去掉

#### (3) F_Idx=2274
冒号及后面的文本被错误的识别为用户名，手动修正即可

":" and the following text are incorrectly identified as usernames; Fix them manually

In [164]:
at_2274 = h_graph[h_graph['F_Idx']==2274]
at_2274

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID,At_User
1797,2274,4380972860466037,2019-06-08 18:13:07+08:00,1914835762,饼玉玉,@饼玉玉：没玩过游戏的 去b站看看塞尔达旷野之息pv就明白了 这个pv都是抄的哦[太开心],,,,,NaT,,,,,[饼玉玉：没玩过游戏的]


In [165]:
h_graph.loc[1797, 'At_User'] = ['饼玉玉']
h_graph.loc[1797]

F_Idx                                                      2274
F_CommID                                       4380972860466037
F_Time                                2019-06-08 18:13:07+08:00
F_UserID                                             1914835762
F_UserName                                                  饼玉玉
F_Comment       @饼玉玉：没玩过游戏的  去b站看看塞尔达旷野之息pv就明白了  这个pv都是抄的哦[太开心]
T_Idx                                                       NaN
T_UserID                                                    NaN
T_UserName                                                  NaN
T_CommID                                                    NaN
T_Time                                                      NaT
T_Comment                                                   NaN
Floor_Idx                                                   NaN
Floor_CommID                                                NaN
Floor_UserID                                                NaN
At_User                                 

#### (4) F_Idx = 1465 (DataFrame idx = 1088)
再次确认，该文本里的链接 http://t.cn/AiCLxa7X@kevenhu 实际上由网页链接http://t.cn/AiCLxa7X 和对朋友的@“@kevenhu”组成<br>
查看At_User可知，用户名“kevenhu”被成功的提取出来。因此该条记录在At_User上无需额外的操作（在当前步骤里）

Just for double check:<br>
The link (http://t.cn/AiCLxa7X@kevenhu) included in the text actually consists of 2 parts:
- link: http://t.cn/AiCLxa7X
- mentioned username: @kevenhu<br>
Username "kevenhu" was successfully extracted and stroed in At_User. Therefore, no additional processings are required on At_User for this record (in THIS CURRENT STEP)

In [166]:
h_graph.loc[1088,'F_Comment']

'#原神# ▶序章PV：捕风的异乡人◀ 维系者正在死去，创造者尚未到来。 但世界不会再度灼烧，因为你将登上「神」之座。  #转发抽奖#关注@原神 并转发@ 一位好友，我们将通过@微博抽奖平台 送出以下奖品：         ※一等奖： PS4 Pro （2名）         ※二等奖： http://t.cn/AiCLxa7X@kevenhu   中中中'

In [167]:
h_graph.loc[1088]

F_Idx                                                        1465
F_CommID                                         4382003786816490
F_Time                                  2019-06-11 14:29:39+08:00
F_UserID                                               2796207394
F_UserName                                               不抽烟CHIRS
F_Comment       #原神# ▶序章PV：捕风的异乡人◀ 维系者正在死去，创造者尚未到来。 但世界不会再度灼烧，...
T_Idx                                                         NaN
T_UserID                                                      NaN
T_UserName                                                    NaN
T_CommID                                                      NaN
T_Time                                                        NaT
T_Comment                                                     NaN
Floor_Idx                                                     NaN
Floor_CommID                                                  NaN
Floor_UserID                                                  NaN
At_User   

In [168]:
h_graph[h_graph['At_User'].notna()]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID,At_User
7,7,4842368296289558,2022-12-02 23:13:51+08:00,7752939212,花月歌浮舟,考古结束@Akoi,,,,,NaT,,,,,[Akoi]
30,45,4481688375200938,2020-03-12 16:20:37+08:00,6153913596,原神忠粉,作为米哈游科技（上海）有限公司制作发行的一款开放世界冒险游戏，画面风格精致富有美感，是一款自...,,,,,NaT,,,,,"[原神, 微博抽奖平台]"
32,47,4461401621863634,2020-01-16 16:48:17+08:00,5648392127,DodLIke刂兆,等私信@_恭弥大人_,,,,,NaT,,,,,[_恭弥大人_]
33,48,4461395208773600,2020-01-16 16:22:48+08:00,5210165672,wangxorz,@b1gcatttttttttt 来个测试资格吧[二哈],,,,,NaT,,,,,[b1gcatttttttttt]
35,53,4460756835567135,2020-01-14 22:06:08+08:00,6555427619,葱花不开花,@相生栗子 嘿嘿，栗子大大出来装个好友帮个忙[太开心][太开心],,,,,NaT,,,,,[相生栗子]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3159,2772,4380924667828394,2019-06-08 15:01:37+08:00,5554932231,A大调的华尔兹,很期待新作@Rusi_,,,,,NaT,,,,,[Rusi_]
3200,2933,4380905709610665,2019-06-08 13:46:17+08:00,2688011343,yukin725,@Rioshady 塞尔达旷野之息…而且估计操作感差很多[允悲][允悲],,,,,NaT,,,,,[Rioshady]
3558,3419,4380893496035358,2019-06-08 12:57:45+08:00,1776100951,ReIKiTsuNeNeGi,@云玩家FlameBeam 手机玩家的钱真好赚？,,,,,NaT,,,,,[云玩家FlameBeam]
3576,3545,4381283213536790,2019-06-09 14:46:20+08:00,6301774190,倾城恋怀,@任天堂香港有限公司,3506.0,6500909470.0,oO土豆泥O,4380885732287240.0,2019-06-08 12:26:53+08:00,这……有些地方一摸一样吧,3506.0,4380885732287240.0,6500909470.0,[任天堂香港有限公司]


### 3.3.3 提取出At_User的每一个list包含的用户名；每一个用户名均存储为新的一列/Extract usernames contained in each list of At_User; Each username will be stored in a new column

In [169]:
at_notna = h_graph[h_graph['At_User'].notna()]

In [170]:
at_notna_idx = list((h_graph[h_graph['At_User'].notna()]).index)
at_notna_idx

[7,
 30,
 32,
 33,
 35,
 47,
 64,
 66,
 72,
 74,
 88,
 91,
 97,
 102,
 108,
 144,
 150,
 163,
 193,
 221,
 222,
 230,
 235,
 246,
 256,
 257,
 266,
 268,
 303,
 324,
 334,
 348,
 372,
 393,
 404,
 412,
 420,
 428,
 433,
 434,
 443,
 446,
 450,
 493,
 494,
 496,
 498,
 504,
 505,
 516,
 562,
 564,
 569,
 576,
 598,
 609,
 610,
 611,
 623,
 626,
 636,
 642,
 692,
 694,
 708,
 710,
 713,
 716,
 717,
 718,
 720,
 725,
 726,
 730,
 731,
 737,
 745,
 753,
 754,
 760,
 767,
 769,
 771,
 774,
 787,
 789,
 797,
 801,
 805,
 806,
 815,
 817,
 820,
 821,
 822,
 824,
 826,
 831,
 833,
 834,
 836,
 837,
 844,
 845,
 846,
 848,
 849,
 850,
 852,
 853,
 854,
 856,
 857,
 858,
 859,
 860,
 861,
 862,
 863,
 864,
 867,
 868,
 869,
 870,
 871,
 872,
 873,
 874,
 875,
 876,
 877,
 878,
 880,
 881,
 882,
 883,
 885,
 886,
 887,
 888,
 890,
 891,
 893,
 894,
 895,
 896,
 900,
 901,
 903,
 904,
 905,
 906,
 907,
 909,
 911,
 912,
 913,
 914,
 916,
 919,
 920,
 921,
 922,
 923,
 924,
 925,
 926,
 927,
 928,


In [171]:
at_notnaS = at_notna['At_User'].apply(pd.Series)
at_notnaS

Unnamed: 0,0,1,2,3,4,5,6
7,Akoi,,,,,,
30,原神,微博抽奖平台,,,,,
32,_恭弥大人_,,,,,,
33,b1gcatttttttttt,,,,,,
35,相生栗子,,,,,,
...,...,...,...,...,...,...,...
3159,Rusi_,,,,,,
3200,Rioshady,,,,,,
3558,云玩家FlameBeam,,,,,,
3576,任天堂香港有限公司,,,,,,


In [172]:
h_graph = pd.concat([h_graph, at_notnaS], axis=1)
h_graph

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,0,1,2,3,4,5,6
0,0,4915861950828535,2023-06-23 18:31:24+08:00,7752387333,祎只狸猫,考古,,,,,...,,,,,,,,,,
1,1,4910942133158267,2023-06-10 04:41:48+08:00,7604058796,邮一棵草莓i,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,,,,,...,,,,,,,,,,
2,2,4862811883439433,2023-01-28 09:09:22+08:00,7577283965,kmimg7,真不知道迅哥给评论区投了多少米？[允悲][doge],,,,,...,,,,,,,,,,
3,3,4862705678417948,2023-01-28 02:07:20+08:00,7476902376,小祀弟弟吖,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],,,,,...,,,,,,,,,,
4,4,4858911816944130,2023-01-17 14:51:54+08:00,7724649649,不知道如何评价,[吃瓜]现在是2023年 回来考古的点个赞,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3845,3837,4380988346320255,2019-06-08 19:14:39+08:00,5646067604,Aki是乌龟,你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就不出续作了，我给您磕头了，...,,,怕事先改名shaw,,...,4380879889748007.0,6201564748.0,,,,,,,,
3846,3838,4380987008064553,2019-06-08 19:09:20+08:00,6201564748,兔纸今天能摸到鱼吗,🐮🍺，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这么烦人的样子确实让我把游戏加...,,,沉迷艾欧泽亚的菜菜啊,,...,4380879889748007.0,6201564748.0,,,,,,,,
3847,3840,4380980682966825,2019-06-08 18:44:12+08:00,6201564748,兔纸今天能摸到鱼吗,到我这条底下阴阳怪气啥？觉得抄袭了去评论前面几条啊，去和米卫兵吵啊，我又没发什么过激言论。再...,,,沙奈朵的裙底到底有什么,,...,4380879889748007.0,6201564748.0,,,,,,,,
3848,216,4415493412865414,2019-09-12 00:25:27+08:00,6530036372,淐馮,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...,215.0,6940897092.0,元首的胖次00658,4404535403395100.0,...,4404535403395100.0,6940897092.0,,,,,,,,


重命名@有关的列名，使数据集更加可读

Rename the related column names to make the dataset more readable

In [173]:
at_colName = {
    0: 'At_U1',
    1: 'At_U2',
    2: 'At_U3',
    3: 'At_U4',
    4: 'At_U5',
    5: 'At_U6',
    6: 'At_U7'}

h_graph.rename(columns = at_colName, inplace=True)
h_graph

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
0,0,4915861950828535,2023-06-23 18:31:24+08:00,7752387333,祎只狸猫,考古,,,,,...,,,,,,,,,,
1,1,4910942133158267,2023-06-10 04:41:48+08:00,7604058796,邮一棵草莓i,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,,,,,...,,,,,,,,,,
2,2,4862811883439433,2023-01-28 09:09:22+08:00,7577283965,kmimg7,真不知道迅哥给评论区投了多少米？[允悲][doge],,,,,...,,,,,,,,,,
3,3,4862705678417948,2023-01-28 02:07:20+08:00,7476902376,小祀弟弟吖,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],,,,,...,,,,,,,,,,
4,4,4858911816944130,2023-01-17 14:51:54+08:00,7724649649,不知道如何评价,[吃瓜]现在是2023年 回来考古的点个赞,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3845,3837,4380988346320255,2019-06-08 19:14:39+08:00,5646067604,Aki是乌龟,你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就不出续作了，我给您磕头了，...,,,怕事先改名shaw,,...,4380879889748007.0,6201564748.0,,,,,,,,
3846,3838,4380987008064553,2019-06-08 19:09:20+08:00,6201564748,兔纸今天能摸到鱼吗,🐮🍺，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这么烦人的样子确实让我把游戏加...,,,沉迷艾欧泽亚的菜菜啊,,...,4380879889748007.0,6201564748.0,,,,,,,,
3847,3840,4380980682966825,2019-06-08 18:44:12+08:00,6201564748,兔纸今天能摸到鱼吗,到我这条底下阴阳怪气啥？觉得抄袭了去评论前面几条啊，去和米卫兵吵啊，我又没发什么过激言论。再...,,,沙奈朵的裙底到底有什么,,...,4380879889748007.0,6201564748.0,,,,,,,,
3848,216,4415493412865414,2019-09-12 00:25:27+08:00,6530036372,淐馮,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...,215.0,6940897092.0,元首的胖次00658,4404535403395100.0,...,4404535403395100.0,6940897092.0,,,,,,,,


## 3.4 后续处理2：移除@username，处理转发/Subsequent processing 2: remove "@username", and handle retweets

数据集中存在转发行为（即：“...text 1...//@username: ...text2...”），需要仔细检查并合理的处理<br>
对于包含了@username的文本，我们将移除@username，保留剩下的文本（如：“...text 1... @username” -> “...text 1...”）

There are some retweets in the dataset (e.g., "...text 1...//@username: ...text2..."). We should check the dataset and handle such records appropriately<br>
Regarding texts including "@username", we will remove "@username" and keep the remaining texts (e.g., "...text 1... @username" -> "...text 1...")

### 3.4.1 处理转发/Handle retweet
at1_set和at1_Idx来自Section 3.3.2 (2)

at1_set and at1_Idx are from Section 3.3.2 (2)

In [174]:
at1_set

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,T_Time,T_Comment,Floor_Idx,Floor_CommID,Floor_UserID,At_User
562,860,4385957564129314,2019-06-22 12:20:33+08:00,3914447813,老老老老老那啊,你们是真不要脸了，哪怕改一改也行啊，复制粘贴不太好吧弟弟//@我再也不想写代码了 :你们真的...,,,,,NaT,,,,,[我再也不想写代码了]
912,1239,4383085795509573,2019-06-14 14:09:10+08:00,6449314474,十香Princess2018,//@糖醋SAO排骨:,,,,,NaT,,,,,[糖醋SAO排骨:]
929,1260,4383059581044564,2019-06-14 12:25:00+08:00,5264335436,忧鱼与熊掌,@襟上花r 琪亚娜 http://t.cn/AiNhbK5k,,,,,NaT,,,,,[襟上花r]
1073,1418,4382090839463454,2019-06-11 20:15:33+08:00,2482877025,昊哥昊哥耗,真实//@花盆栽柳树谁也拦不住:国内厂商总以为玩家是傻子，东抄抄西抄抄就能出来骗钱[微笑]然...,,,,,NaT,,,,,[花盆栽柳树谁也拦不住:国内厂商总以为玩家是傻子，东抄抄西抄抄就能出来骗钱[微笑]然而事实是...
1088,1465,4382003786816490,2019-06-11 14:29:39+08:00,2796207394,不抽烟CHIRS,#原神# ▶序章PV：捕风的异乡人◀ 维系者正在死去，创造者尚未到来。 但世界不会再度灼烧，...,,,,,NaT,,,,,"[原神, 微博抽奖平台, kevenhu]"
1316,1702,4381267762552372,2019-06-09 13:44:56+08:00,1740544170,不爱发博王左军,@不爱写字王右军 //@游研社: 米哈游新作《原神》放出了新预告,,,,,NaT,,,,,"[不爱写字王右军, 游研社:]"
1396,1786,4381190587010404,2019-06-09 08:38:16+08:00,1093773890,Luciferbear,@只为等待your PS4 PRO http://t.cn/AiC5X41g,,,,,NaT,,,,,[只为等待your]
1655,2109,4380996277150017,2019-06-08 19:46:10+08:00,3218789155,天雷牙皇,//@游研社: 米哈游新作《原神》放出了新预告,,,,,NaT,,,,,[游研社:]
2075,2779,4380924198016900,2019-06-08 14:59:45+08:00,1765337557,DDDDDlno,边抄边加了很多新的，光看pv还是很舒服的。希望能讲好自己的故事。//@游民星空: 米哈游新作...,,,,,NaT,,,,,[游民星空:]
2265,3149,4380897346334849,2019-06-08 13:13:03+08:00,5846473488,幸运的血玫瑰男爵,@Mr_陈家铧_924_106 @神座出流_ //@游侠网: 来了来了[并不简单][并不简单],,,,,NaT,,,,,"[Mr_陈家铧_924_106, 神座出流_, 游侠网:]"


以及他们的DataFrame idx (同样来自于Section 3.3.2 (2))
And their DataFrame Index (also obtained from Section 3.3.2 (2))

In [175]:
at1_Idx

[562,
 912,
 929,
 1073,
 1088,
 1316,
 1396,
 1655,
 2075,
 2265,
 2415,
 2985,
 2998,
 3084]

In [176]:
h_graph[h_graph['F_Comment'].str.contains('//@')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
562,860,4385957564129314,2019-06-22 12:20:33+08:00,3914447813,老老老老老那啊,你们是真不要脸了，哪怕改一改也行啊，复制粘贴不太好吧弟弟//@我再也不想写代码了 :你们真的...,,,,,...,,,[我再也不想写代码了],我再也不想写代码了,,,,,,
912,1239,4383085795509573,2019-06-14 14:09:10+08:00,6449314474,十香Princess2018,//@糖醋SAO排骨:,,,,,...,,,[糖醋SAO排骨],糖醋SAO排骨,,,,,,
1073,1418,4382090839463454,2019-06-11 20:15:33+08:00,2482877025,昊哥昊哥耗,真实//@花盆栽柳树谁也拦不住:国内厂商总以为玩家是傻子，东抄抄西抄抄就能出来骗钱[微笑]然...,,,,,...,,,[花盆栽柳树谁也拦不住],花盆栽柳树谁也拦不住,,,,,,
1316,1702,4381267762552372,2019-06-09 13:44:56+08:00,1740544170,不爱发博王左军,@不爱写字王右军 //@游研社: 米哈游新作《原神》放出了新预告,,,,,...,,,"[不爱写字王右军, 游研社]",不爱写字王右军,游研社,,,,,
1655,2109,4380996277150017,2019-06-08 19:46:10+08:00,3218789155,天雷牙皇,//@游研社: 米哈游新作《原神》放出了新预告,,,,,...,,,[游研社],游研社,,,,,,
2075,2779,4380924198016900,2019-06-08 14:59:45+08:00,1765337557,DDDDDlno,边抄边加了很多新的，光看pv还是很舒服的。希望能讲好自己的故事。//@游民星空: 米哈游新作...,,,,,...,,,[游民星空],游民星空,,,,,,
2265,3149,4380897346334849,2019-06-08 13:13:03+08:00,5846473488,幸运的血玫瑰男爵,@Mr_陈家铧_924_106 @神座出流_ //@游侠网: 来了来了[并不简单][并不简单],,,,,...,,,"[Mr_陈家铧_924_106, 神座出流_, 游侠网]",Mr_陈家铧_924_106,神座出流_,游侠网,,,,
2415,3779,4380883211537085,2019-06-08 12:16:52+08:00,1879393202,寒天黑糖,好可爱的小男孩！！！！wsl//@半透明黑桶: 看！是PV！[泪]@寒天黑糖,,,,,...,,,"[半透明黑桶, 寒天黑糖]",半透明黑桶,寒天黑糖,,,,,


经过调查发现，游研社、游民星空和游侠网这些都是大v号（即，经过微博认证、具有一定影响力的账户，一般多为自媒体）。下面是这3个账户出现在F_Comment里的每条记录的DataFrame索引
- 游研社：idx 1316 1655
- 游民星空：idx 2075
- 游侠网：idx 2265
<br>我们可以发现，这3个账户均不存在于用户数据库里

After exploration, "游研社", "游民星空" and "游侠网" are those verified and influential accounts on Weibo. We list the DataFrame index below for each record with these accounts in F_Comment
- 游研社：idx 1316 1655
- 游民星空：idx 2075
- 游侠网：idx 2265
<br>We can find that, all of these 3 accounts are not in our user database

In [177]:
user_data[user_data['UserName']=='游研社']

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,DescriClean,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank


In [178]:
user_data[user_data['UserName']=='游民星空']

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,DescriClean,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank


In [179]:
user_data[user_data['UserName']=='游侠网']

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,DescriClean,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank


查看微博也可以发现，这3个账户仅转发了目标博文，并没有在评论区留下任何评论。作为具有影响力的自媒体，他们的作用在于通过转发来扩散信息、宣传目标博文，并不涉及有游戏相关的讨论。
<br>因此，这3类账户不应当出现在用户互动关系网里，即：我们要从At_User里移除这3类账户

Chekcing details about this target post on Weibo also reveals that these 3 accounts only retweeted the target post and didn't leave any comments in the comment section. As influential self-medias, they are only responsible for information diffusion and promoting the target post via retweeting. They didn't engage in any game-related discussions.
<br>Therefore, all of these 3 accounts should not appear in At_User: we need to remove them from At_User.

至于其他几条转发行为，很明显都是对其他用户（而非自媒体）的转发进行转发；且都在讨论与游戏有关的内容。
<br>作为原始转发人，如下面例子中的“我再也不想写代码了”用户，它的原始文本“你们真的不怕任天堂告吗……有些地方完全一模一样[疑问]”就是对目标博文的转发：因为用户数据库并未包含该用户的信息
<br>这一类用户（即“我再也不想写代码了”）的文本可以被视为一条新的记录，并保存到我们的数据集里，作为From的用户

Regarding other retweets in this dataset, it is clear that are all retweets of retweets from other users (not self-media); They also discussed game-related content
<br>As an original retweeted, such as the user "我再也不想写代码了" in the example below, his/her originial text "你们真的不怕任天堂告吗……有些地方完全一模一样[疑问]" is a retweet for the target post: there is no information about this user in our user database `user_data`

In [180]:
h_graph.loc[562,'F_Comment']

'你们是真不要脸了，哪怕改一改也行啊，复制粘贴不太好吧弟弟//@我再也不想写代码了 :你们真的不怕任天堂告吗……有些地方完全一模一样[疑问]'

In [181]:
user_data[user_data['UserName']=='我再也不想写代码了']

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,DescriClean,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank


In [182]:
h_graph.loc[1073,'F_Comment']

'真实//@花盆栽柳树谁也拦不住:国内厂商总以为玩家是傻子，东抄抄西抄抄就能出来骗钱[微笑]然而事实是，国内很大一部分玩家确实是傻子[可爱]'

In [183]:
user_data[user_data['UserName']=='花盆栽柳树谁也拦不住']

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,DescriClean,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank


In [184]:
h_graph.loc[2415,'F_Comment']

'好可爱的小男孩！！！！wsl//@半透明黑桶: 看！是PV！[泪]@寒天黑糖'

In [185]:
user_data[user_data['UserName']=='半透明黑桶']

Unnamed: 0,UserID,FirstIndex,IndexList,UserName,TotalComment,Comment,Reply,LikeCount,Province,ProvinceCode,...,DescriClean,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank


现在开始处理这2种转发

Handle these 2 types of retweets

#### (1) 删除自媒体的转发/Remove retweets from self-medias
操作
- 使用DataFrame索引辨识。索引分别为： 1316，1655，2075，2265
- 使用split，并提取“//@”前的部分

Operation
- Use DataFrame index to identify the records. Index: 1316, 1655, 2075, 2265
- User `split`, and then extract the text before "//@"

In [186]:
for i in [1316,1655,2075,2265]:
    h_graph.loc[i,'F_Comment'] = h_graph.loc[i,'F_Comment'].split('//@')[0]
h_graph.loc[[1316,1655,2075,2265],'F_Comment']

1316                           @不爱写字王右军 
1655                                    
2075    边抄边加了很多新的，光看pv还是很舒服的。希望能讲好自己的故事。
2265            @Mr_陈家铧_924_106  @神座出流_ 
Name: F_Comment, dtype: object

In [187]:
h_graph.loc[[1316,1655,2075,2265]]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
1316,1702,4381267762552372,2019-06-09 13:44:56+08:00,1740544170,不爱发博王左军,@不爱写字王右军,,,,,...,,,"[不爱写字王右军, 游研社]",不爱写字王右军,游研社,,,,,
1655,2109,4380996277150017,2019-06-08 19:46:10+08:00,3218789155,天雷牙皇,,,,,,...,,,[游研社],游研社,,,,,,
2075,2779,4380924198016900,2019-06-08 14:59:45+08:00,1765337557,DDDDDlno,边抄边加了很多新的，光看pv还是很舒服的。希望能讲好自己的故事。,,,,,...,,,[游民星空],游民星空,,,,,,
2265,3149,4380897346334849,2019-06-08 13:13:03+08:00,5846473488,幸运的血玫瑰男爵,@Mr_陈家铧_924_106 @神座出流_,,,,,...,,,"[Mr_陈家铧_924_106, 神座出流_, 游侠网]",Mr_陈家铧_924_106,神座出流_,游侠网,,,,


手动处理

Manually handle

In [188]:
h_graph.loc[1316, 'At_U2'] = np.nan
h_graph.loc[[1655,2075], 'At_U1'] = np.nan
h_graph.loc[2265, 'At_U3'] = np.nan
h_graph.loc[[1316,1655,2075,2265]]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
1316,1702,4381267762552372,2019-06-09 13:44:56+08:00,1740544170,不爱发博王左军,@不爱写字王右军,,,,,...,,,"[不爱写字王右军, 游研社]",不爱写字王右军,,,,,,
1655,2109,4380996277150017,2019-06-08 19:46:10+08:00,3218789155,天雷牙皇,,,,,,...,,,[游研社],,,,,,,
2075,2779,4380924198016900,2019-06-08 14:59:45+08:00,1765337557,DDDDDlno,边抄边加了很多新的，光看pv还是很舒服的。希望能讲好自己的故事。,,,,,...,,,[游民星空],,,,,,,
2265,3149,4380897346334849,2019-06-08 13:13:03+08:00,5846473488,幸运的血玫瑰男爵,@Mr_陈家铧_924_106 @神座出流_,,,,,...,,,"[Mr_陈家铧_924_106, 神座出流_, 游侠网]",Mr_陈家铧_924_106,神座出流_,,,,,


处理F_Comment

Deal with F_Comment

In [189]:
h_graph.loc[[1316,2265],'F_Comment'] = ''
h_graph.loc[[1316,1655,2075,2265]]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
1316,1702,4381267762552372,2019-06-09 13:44:56+08:00,1740544170,不爱发博王左军,,,,,,...,,,"[不爱写字王右军, 游研社]",不爱写字王右军,,,,,,
1655,2109,4380996277150017,2019-06-08 19:46:10+08:00,3218789155,天雷牙皇,,,,,,...,,,[游研社],,,,,,,
2075,2779,4380924198016900,2019-06-08 14:59:45+08:00,1765337557,DDDDDlno,边抄边加了很多新的，光看pv还是很舒服的。希望能讲好自己的故事。,,,,,...,,,[游民星空],,,,,,,
2265,3149,4380897346334849,2019-06-08 13:13:03+08:00,5846473488,幸运的血玫瑰男爵,,,,,,...,,,"[Mr_陈家铧_924_106, 神座出流_, 游侠网]",Mr_陈家铧_924_106,神座出流_,,,,,


#### (2) 处理其他的转发/Deal with another type of retweets

这一类仅有4条数据

Only 4 records in this type

In [190]:
h_graph[h_graph['F_Comment'].str.contains('//@')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
562,860,4385957564129314,2019-06-22 12:20:33+08:00,3914447813,老老老老老那啊,你们是真不要脸了，哪怕改一改也行啊，复制粘贴不太好吧弟弟//@我再也不想写代码了 :你们真的...,,,,,...,,,[我再也不想写代码了],我再也不想写代码了,,,,,,
912,1239,4383085795509573,2019-06-14 14:09:10+08:00,6449314474,十香Princess2018,//@糖醋SAO排骨:,,,,,...,,,[糖醋SAO排骨],糖醋SAO排骨,,,,,,
1073,1418,4382090839463454,2019-06-11 20:15:33+08:00,2482877025,昊哥昊哥耗,真实//@花盆栽柳树谁也拦不住:国内厂商总以为玩家是傻子，东抄抄西抄抄就能出来骗钱[微笑]然...,,,,,...,,,[花盆栽柳树谁也拦不住],花盆栽柳树谁也拦不住,,,,,,
2415,3779,4380883211537085,2019-06-08 12:16:52+08:00,1879393202,寒天黑糖,好可爱的小男孩！！！！wsl//@半透明黑桶: 看！是PV！[泪]@寒天黑糖,,,,,...,,,"[半透明黑桶, 寒天黑糖]",半透明黑桶,寒天黑糖,,,,,


使用F_Idx作为参考来处理
- 860 1239 1418 3779

Use F_Idx to handle the data

In [191]:
for f_idx in [860, 1239, 1418, 3779]:
    t_idx = list(h_graph[h_graph['F_Idx']==f_idx].index)[0] # 读取目标的索引/Read the index
    target_com = h_graph[h_graph['F_Idx']==f_idx]['F_Comment'][t_idx] # 提取目标的comment/Extract the comment/text
    # 提取对应的东西/Extract what we need
    f_comm = target_com.split('//@')[0] # 切出comment/Split and obtain the comment
    new_data = target_com.split('//@')[1]
    if i !=3779:
        split_p = ':'
    else:
        split_p = ': '
    new_user = new_data.split(split_p)[0] # username，新数据/Obtain the username
    new_comm = new_data.split(split_p)[1] # 新数据的comment/Obtain the corresponding texts

    '''先处理原始的数据/Hanlde the original data first'''
    h_graph.loc[t_idx,'F_Comment'] = f_comm
    h_graph.loc[t_idx,'T_UserName'] = new_user
    '''处理新数据/Then handle the new one'''
    insert_idx = t_idx + 0.5
    h_graph.loc[insert_idx, ['F_UserName','F_Comment']] = new_user,new_comm
    h_graph = h_graph.sort_index().reset_index(drop=True)
    

In [192]:
h_graph.loc[[562,563,913,914,1075,1076,2418,2419]]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
562,860.0,4385957564129313.5,2019-06-22 12:20:33+08:00,3914447813.0,老老老老老那啊,你们是真不要脸了，哪怕改一改也行啊，复制粘贴不太好吧弟弟,,,我再也不想写代码了,,...,,,[我再也不想写代码了],我再也不想写代码了,,,,,,
563,,,NaT,,我再也不想写代码了,你们真的不怕任天堂告吗……有些地方完全一模一样[疑问],,,,,...,,,,,,,,,,
913,1239.0,4383085795509573.0,2019-06-14 14:09:10+08:00,6449314474.0,十香Princess2018,,,,糖醋SAO排骨,,...,,,[糖醋SAO排骨],糖醋SAO排骨,,,,,,
914,,,NaT,,糖醋SAO排骨,,,,,,...,,,,,,,,,,
1075,1418.0,4382090839463454.5,2019-06-11 20:15:33+08:00,2482877025.0,昊哥昊哥耗,真实,,,花盆栽柳树谁也拦不住,,...,,,[花盆栽柳树谁也拦不住],花盆栽柳树谁也拦不住,,,,,,
1076,,,NaT,,花盆栽柳树谁也拦不住,国内厂商总以为玩家是傻子，东抄抄西抄抄就能出来骗钱[微笑]然而事实是，国内很大一部分玩家确实...,,,,,...,,,,,,,,,,
2418,3779.0,4380883211537085.0,2019-06-08 12:16:52+08:00,1879393202.0,寒天黑糖,好可爱的小男孩！！！！wsl,,,半透明黑桶,,...,,,"[半透明黑桶, 寒天黑糖]",半透明黑桶,寒天黑糖,,,,,
2419,,,NaT,,半透明黑桶,看！是PV！[泪]@寒天黑糖,,,,,...,,,,,,,,,,


In [193]:
len(h_graph)

3854

对DataFrame Index为2418和2419的手动处理

Manually handle records for DataFrame Index=2418 and 2419

In [194]:
h_graph.loc[2418,'At_U2'] = np.nan
h_graph.loc[2419,['At_User','At_U1']] = ['寒天黑糖'],'寒天黑糖'
h_graph.loc[[2418,2419]]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
2418,3779.0,4380883211537085.0,2019-06-08 12:16:52+08:00,1879393202.0,寒天黑糖,好可爱的小男孩！！！！wsl,,,半透明黑桶,,...,,,"[半透明黑桶, 寒天黑糖]",半透明黑桶,,,,,,
2419,,,NaT,,半透明黑桶,看！是PV！[泪]@寒天黑糖,,,,,...,,,[寒天黑糖],寒天黑糖,,,,,,


### 3.4.2 移除“@username”/Remove "@username"

转发的样式“//@username”已经在Section 3.4.1被彻底移除。

The retweeting pattern "//@username" has already been removed in Section 3.4.1

In [195]:
h_graph[h_graph['F_Comment'].str.contains('//@')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7


因此可以批量处理数据集中的文本：删除F_Comment里所有“@username”的样式

Therefore we can remove all "@username" patterns in F_Comment

In [196]:
hetero_data = h_graph.copy()
hetero_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 23 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   F_Idx         3850 non-null   float64                  
 1   F_CommID      3850 non-null   float64                  
 2   F_Time        3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID      3850 non-null   float64                  
 4   F_UserName    3854 non-null   object                   
 5   F_Comment     3854 non-null   object                   
 6   T_Idx         441 non-null    float64                  
 7   T_UserID      907 non-null    float64                  
 8   T_UserName    1191 non-null   object                   
 9   T_CommID      441 non-null    float64                  
 10  T_Time        441 non-null    datetime64[ns, UTC+08:00]
 11  T_Comment     441 non-null    object                   
 12  Floor_Idx     1186 non-null   floa

对于包含了网页链接（如：htttp）的文本将在Section 3.4.3 进行处理

We will handle the texts with website link (e.g., http) in Section 3.4.3

In [197]:
hetero_data[(hetero_data['F_Comment'].str.contains('http')) & (hetero_data['F_Comment'].str.contains('@'))]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
931,1260.0,4383059581044564.0,2019-06-14 12:25:00+08:00,5264335436.0,忧鱼与熊掌,@襟上花r 琪亚娜 http://t.cn/AiNhbK5k,,,,,...,,,[襟上花r],襟上花r,,,,,,
1091,1465.0,4382003786816489.5,2019-06-11 14:29:39+08:00,2796207394.0,不抽烟CHIRS,#原神# ▶序章PV：捕风的异乡人◀ 维系者正在死去，创造者尚未到来。 但世界不会再度灼烧，...,,,,,...,,,"[原神, 微博抽奖平台, kevenhu]",原神,微博抽奖平台,kevenhu,,,,
1399,1786.0,4381190587010404.0,2019-06-09 08:38:16+08:00,1093773890.0,Luciferbear,@只为等待your PS4 PRO http://t.cn/AiC5X41g,,,,,...,,,[只为等待your],只为等待your,,,,,,


In [198]:
comment_data[(comment_data['CommentRaw'].str.contains('http'))]

Unnamed: 0,CommentID,CommentTime,RootID,CommentRaw,Comment,CommentLike,CommentReply,UserID,UserName,Province,...,Region,UserLocation,UserDescription,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
180,4408376185859132,2019-08-23 09:04:09+08:00,0,各位版权卫士在吗？http://t.cn/AiTZjEoV 去吧，记得告诉我结果。提醒一下，...,各位版权卫士在吗？网页链接 去吧，记得告诉我结果。提醒一下，举报是要实名制的，如果有虚报，是...,5,0.0,5516757915,四方羊尊,福建,...,泉州,福建 泉州,,1,0,34,0,0,0,0
231,4403321353740036,2019-08-09 10:18:03+08:00,0,各位版权卫士在吗？http://t.cn/AiTZjEoV 去吧，记得告诉我结果。提醒一下，...,各位版权卫士在吗？网页链接 去吧，记得告诉我结果。提醒一下，举报是要实名制的，如果有虚报，是...,0,1.0,6604518742,创世界AP57017,吉林,...,吉林,吉林 吉林,东北电力大学18级学生,1,4,93,17,0,0,0
258,4402777024950108,2019-08-07 22:15:05+08:00,0,各位版权卫士在吗？http://t.cn/AiTZjEoV 去吧，记得告诉我结果。提醒一下，...,各位版权卫士在吗？网页链接 去吧，记得告诉我结果。提醒一下，举报是要实名制的，如果有虚报，是...,0,0.0,6935465869,迷眼鳴,广东,...,,广东,,1,6,169,67,0,0,0
392,4401297660466221,2019-08-03 20:16:37+08:00,0,图片评论 http://t.cn/AiYOOAVr,图片评论,6,7.0,3215896104,巧克力的铲s官,其他,...,,其他,,1,104,386,2289,0,0,0
403,4401222196411567,2019-08-03 15:16:44+08:00,0,图片评论 http://t.cn/AiYpIotC,图片评论,2,0.0,3666589980,惊讶ghia,其他,...,,其他,纯洁善良且富有正义感的外衣。,1,39,110,12,0,0,0
406,4401164697167415,2019-08-03 11:28:16+08:00,0,图片评论 http://t.cn/AiYCjLyc,图片评论,3,0.0,1887405905,白圭君,浙江,...,杭州,浙江 杭州,自留地,1,588,886,5994,0,0,0
414,4401475629270288,2019-08-04 08:03:48+08:00,4401129423178684,回复@Rmbaci:？？？屁还是您老会放啊 http://t.cn/AiYjKUTh,回复@Rmbaci:？？？屁还是您老会放啊 查看图片,0,,2874387804,天不再晴朗Lk,江苏,...,常州,江苏 常州,,0,53,473,259,0,0,0
479,4399351432156798,2019-07-29 11:23:00+08:00,0,http://t.cn/AijE84Wh,网页链接,0,1.0,6151921836,兔兔谈谈,广西,...,,广西,一个叫山东烟台的看到一个两字高潮，让我笑死了,1,3,42,27,0,0,0
604,4387335388619657,2019-06-26 07:35:31+08:00,0,图片评论 http://t.cn/AipRbYB1,图片评论,1,0.0,5792044643,建议微博尽快倒闭,江西,...,,江西,,1,22,108,20,0,0,0
618,4387161064938438,2019-06-25 20:02:49+08:00,0,如果这个游戏活下来并且火了，那么以后中国的游戏就真的没救了。 http://t.cn/Aip...,如果这个游戏活下来并且火了，那么以后中国的游戏就真的没救了。,2,0.0,3991684702,travis-B,其他,...,,其他,,1,5,136,52,0,0,7


#### (1) 移除@username/Remove "@username"

In [199]:
def remove_un(text):
    cleaned_text = re.sub(r'@([^\s@]+)',' ', text)
    # @([^\s@]+)
    # @[\w]+
    # cleaned_text = cleaned_text.replace(' ','')
    return cleaned_text

In [200]:
hetero_data['F_Comment'] = hetero_data['F_Comment'].apply(remove_un)

In [201]:
hetero_data[hetero_data['F_Comment'].str.contains('@')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
1091,1465.0,4382003786816489.5,2019-06-11 14:29:39+08:00,2796207394.0,不抽烟CHIRS,#原神# ▶序章PV：捕风的异乡人◀ 维系者正在死去，创造者尚未到来。 但世界不会再度灼烧，...,,,,,...,,,"[原神, 微博抽奖平台, kevenhu]",原神,微博抽奖平台,kevenhu,,,,
1178,1554.0,4381667529532503.0,2019-06-10 16:13:29+08:00,5504761484.0,心镜D,@,,,,,...,,,[一碗肉很多的皮蛋瘦肉粥],一碗肉很多的皮蛋瘦肉粥,,,,,,


DataFrame Index=1178的用户因为文本里带有2个@被错误的处理了，将在下一个部分对其进行修复

The user with DataFrame Index=1178 was mishandled for having 2 "@" in his/her text. We will fix this in the next section

In [202]:
h_graph.loc[1178]

F_Idx                              1554.0
F_CommID               4381667529532503.0
F_Time          2019-06-10 16:13:29+08:00
F_UserID                     5504761484.0
F_UserName                            心镜D
F_Comment                   @@一碗肉很多的皮蛋瘦肉粥
T_Idx                                 NaN
T_UserID                              NaN
T_UserName                            NaN
T_CommID                              NaN
T_Time                                NaT
T_Comment                             NaN
Floor_Idx                             NaN
Floor_CommID                          NaN
Floor_UserID                          NaN
At_User                     [一碗肉很多的皮蛋瘦肉粥]
At_U1                         一碗肉很多的皮蛋瘦肉粥
At_U2                                 NaN
At_U3                                 NaN
At_U4                                 NaN
At_U5                                 NaN
At_U6                                 NaN
At_U7                                 NaN
Name: 1178, dtype: object

#### (2) 修正被误操作的特殊案例/Fixing the special cases mishandled

第一个特殊案例已在Section 3.3.2 (1) 中讨论过

The first special case listed below was discussed in the Section 3.3.2 (1)

In [203]:
h_graph[h_graph['F_Idx']==1139]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
816,1139.0,4384309676961753.0,2019-06-17 23:12:25+08:00,2208067411.0,GanChongren,@原神 @官方，我真是小机灵鬼[笑而不语],,,,,...,,,[原神],原神,,,,,,


In [204]:
hetero_data[hetero_data['F_Idx']==1139]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
816,1139.0,4384309676961753.0,2019-06-17 23:12:25+08:00,2208067411.0,GanChongren,,,,,,...,,,[原神],原神,,,,,,


In [205]:
hetero_data.loc[816, 'F_Comment'] = '@官方，我真是小机灵鬼[笑而不语]'
hetero_data[hetero_data['F_Idx']==1139]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
816,1139.0,4384309676961753.0,2019-06-17 23:12:25+08:00,2208067411.0,GanChongren,@官方，我真是小机灵鬼[笑而不语],,,,,...,,,[原神],原神,,,,,,


第二个特殊案例是DataFrame index=1091的用户。其文本“#原神# ▶序章PV：捕风的异乡人◀ 维系者正在死去，创造者尚未到来。 但世界不会再度灼烧，因为你将登上「神」之座。  #转发抽奖#关注  并转发@ 一位好友，我们将通过  送出以下奖品：         ※一等奖： PS4 Pro （2名）         ※二等奖： http://t.cn/AiCLxa7X ”是由微博抽奖平台发布的抽奖信息，因此移除即可
<br>值得注意的是，“中中中”是用户自己的评论，因此应当被保留

The second special case is the user with DataFrame index=1091. The "#原神# ▶序章PV：捕风的异乡人◀ 维系者正在死去，创造者尚未到来。 但世界不会再度灼烧，因为你将登上「神」之座。  #转发抽奖#关注  并转发@ 一位好友，我们将通过  送出以下奖品：         ※一等奖： PS4 Pro （2名）         ※二等奖： http://t.cn/AiCLxa7X" in his/her texts is same as the content posted by the official Weibo lottery, removing it will be appropriate
<br>It is worth noting that "中中中" is the user's own comment and should be kept

In [206]:
hetero_data.loc[1091, 'F_Comment']

'#原神# ▶序章PV：捕风的异乡人◀ 维系者正在死去，创造者尚未到来。 但世界不会再度灼烧，因为你将登上「神」之座。  #转发抽奖#关注  并转发@ 一位好友，我们将通过  送出以下奖品：         ※一等奖： PS4 Pro （2名）         ※二等奖： http://t.cn/AiCLxa7X    中中中'

In [207]:
hetero_data.loc[1091, 'F_Comment'] = '中中中'
hetero_data.loc[1091, 'F_Comment']

'中中中'

最后一个是DataFrame index=1091的用户，这在Section 3.4.2 (1)里讨论过了。因为这个用户仅仅是@了他的朋友“一碗肉很多的皮蛋瘦肉粥”而没有留下任何文本，直接让这名用户的F_Comment为空即可

The final one is the user with DataFrame index=1091, which discussed in the previous Section 3.4.2 (1). Since this user only mentioned his/her friend "一碗肉很多的皮蛋瘦肉粥" without any texts, we set the value for F_Comment of this user as null

In [208]:
hetero_data.loc[1178, 'F_Comment']

'@ '

In [209]:
hetero_data.loc[1178, 'F_Comment'] = ''
hetero_data.loc[1178, 'F_Comment']

''

确认：我们的操作成功的移除了所有带有@的样式

We can confirm that all patterns with @ were successfully removed

In [210]:
hetero_data[(hetero_data['At_User'].notna()) & (hetero_data['F_Comment'].str.contains('@'))]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
816,1139.0,4384309676961753.0,2019-06-17 23:12:25+08:00,2208067411.0,GanChongren,@官方，我真是小机灵鬼[笑而不语],,,,,...,,,[原神],原神,,,,,,


### 3.4.3 移除网页链接 (如：http)/Remove the website link (e.g., http)

In [211]:
hetero_data[hetero_data['F_Comment'].str.contains('http')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
110,180.0,4408376185859132.0,2019-08-23 09:04:09+08:00,5516757915.0,四方羊尊,各位版权卫士在吗？http://t.cn/AiTZjEoV 去吧，记得告诉我结果。提醒一下，...,,,,,...,,,,,,,,,,
149,258.0,4402777024950108.0,2019-08-07 22:15:05+08:00,6935465869.0,迷眼鳴,各位版权卫士在吗？http://t.cn/AiTZjEoV 去吧，记得告诉我结果。提醒一下，...,,,,,...,,,,,,,,,,
202,403.0,4401222196411567.0,2019-08-03 15:16:44+08:00,3666589980.0,惊讶ghia,图片评论 http://t.cn/AiYpIotC,,,,,...,,,,,,,,,,
205,406.0,4401164697167415.0,2019-08-03 11:28:16+08:00,1887405905.0,白圭君,图片评论 http://t.cn/AiYCjLyc,,,,,...,,,,,,,,,,
356,604.0,4387335388619657.0,2019-06-26 07:35:31+08:00,5792044643.0,建议微博尽快倒闭,图片评论 http://t.cn/AipRbYB1,,,,,...,,,,,,,,,,
370,618.0,4387161064938438.5,2019-06-25 20:02:49+08:00,3991684702.0,travis-B,如果这个游戏活下来并且火了，那么以后中国的游戏就真的没救了。 http://t.cn/Aip...,,,,,...,,,,,,,,,,
375,623.0,4387138608664531.0,2019-06-25 18:33:35+08:00,6473448507.0,vocaloid丶亚北,原神官方终于没装死了http://t.cn/AipYfYc5，我只搬运扩散这个帖子，怎么看，...,,,,,...,,,,,,,,,,
440,692.0,4386418253630532.0,2019-06-23 18:51:10+08:00,2908847995.0,扶桑桑树,图片评论 http://t.cn/AipxihGc,,,,,...,,,,,,,,,,
527,802.0,4386030491091779.0,2019-06-22 17:10:19+08:00,5655792659.0,yuradesu,图片评论 http://t.cn/AipbDlly,,,,,...,,,,,,,,,,
528,803.0,4386030385786704.0,2019-06-22 17:09:54+08:00,5655792659.0,yuradesu,在?会抄就多抄点? http://t.cn/AipbDajQ,,,,,...,,,,,,,,,,


In [212]:
link_idx = list(hetero_data[hetero_data['F_Comment'].str.contains('http')].index)
link_idx

[110,
 149,
 202,
 205,
 356,
 370,
 375,
 440,
 527,
 528,
 690,
 893,
 931,
 1358,
 1366,
 1399,
 1673,
 1940,
 1953,
 1979,
 2091,
 2364,
 2557,
 2655,
 2670,
 2694,
 2779,
 2794,
 3078,
 3246,
 3266,
 3311,
 3429,
 3438,
 3671,
 3842]

In [213]:
def remove_link(text):
    link_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    clean_text = re.sub(link_pattern,'',text).replace(' ','')
    return clean_text

hetero_data.loc[link_idx,'F_Comment'] = hetero_data.loc[link_idx,'F_Comment'].apply(remove_link)

In [214]:
hetero_data.loc[link_idx]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
110,180.0,4408376185859132.0,2019-08-23 09:04:09+08:00,5516757915.0,四方羊尊,各位版权卫士在吗？去吧，记得告诉我结果。提醒一下，举报是要实名制的，如果有虚报，是要付法律责...,,,,,...,,,,,,,,,,
149,258.0,4402777024950108.0,2019-08-07 22:15:05+08:00,6935465869.0,迷眼鳴,各位版权卫士在吗？去吧，记得告诉我结果。提醒一下，举报是要实名制的，如果有虚报，是要负法律责...,,,,,...,,,,,,,,,,
202,403.0,4401222196411567.0,2019-08-03 15:16:44+08:00,3666589980.0,惊讶ghia,图片评论,,,,,...,,,,,,,,,,
205,406.0,4401164697167415.0,2019-08-03 11:28:16+08:00,1887405905.0,白圭君,图片评论,,,,,...,,,,,,,,,,
356,604.0,4387335388619657.0,2019-06-26 07:35:31+08:00,5792044643.0,建议微博尽快倒闭,图片评论,,,,,...,,,,,,,,,,
370,618.0,4387161064938438.5,2019-06-25 20:02:49+08:00,3991684702.0,travis-B,如果这个游戏活下来并且火了，那么以后中国的游戏就真的没救了。,,,,,...,,,,,,,,,,
375,623.0,4387138608664531.0,2019-06-25 18:33:35+08:00,6473448507.0,vocaloid丶亚北,原神官方终于没装死了，我只搬运扩散这个帖子，怎么看，怎么理解那是你们的事情，美化自己也好，觉...,,,,,...,,,,,,,,,,
440,692.0,4386418253630532.0,2019-06-23 18:51:10+08:00,2908847995.0,扶桑桑树,图片评论,,,,,...,,,,,,,,,,
527,802.0,4386030491091779.0,2019-06-22 17:10:19+08:00,5655792659.0,yuradesu,图片评论,,,,,...,,,,,,,,,,
528,803.0,4386030385786704.0,2019-06-22 17:09:54+08:00,5655792659.0,yuradesu,在?会抄就多抄点?,,,,,...,,,,,,,,,,


## 3.5 后续处理3：解码emoji和网络用语/Decode emoji and Internet slangs

### 3.5.1 将emoji转换为中文编码/Transform each emoji to its Chinese code

In [215]:
hete_emoji = hetero_data.copy()
hete_emoji.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 23 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   F_Idx         3850 non-null   float64                  
 1   F_CommID      3850 non-null   float64                  
 2   F_Time        3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID      3850 non-null   float64                  
 4   F_UserName    3854 non-null   object                   
 5   F_Comment     3854 non-null   object                   
 6   T_Idx         441 non-null    float64                  
 7   T_UserID      907 non-null    float64                  
 8   T_UserName    1191 non-null   object                   
 9   T_CommID      441 non-null    float64                  
 10  T_Time        441 non-null    datetime64[ns, UTC+08:00]
 11  T_Comment     441 non-null    object                   
 12  Floor_Idx     1186 non-null   floa

一个包含了emoji文本（即 下条文本中的🐮🍺）的实例

An example about texts with emoji (i.e., 🐮🍺 in the text below)

In [216]:
hete_emoji[hete_emoji['F_UserName']=='中华家的窝窝头']

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
1273,1656.0,4381341875467350.5,2019-06-09 18:39:27+08:00,5130467812.0,中华家的窝窝头,米忽悠🐮🍺噢！已经预约了噢！不知道能不能上PS4呢[许愿][许愿][许愿]我买PS4就是...,,,,,...,,,[猫子啊啊啊],猫子啊啊啊,,,,,,


在Section 2.7中我们定义了函数decode_emoji来将emoji转换为对应的中文代码

In Section 2.7, we defined a function called `decode_emoji` to convert emoji into the corresponding Chinese code

In [217]:
def decode_emoji(text):
    return emoji.demojize(text,language='zh')

In [218]:
# 一个简单的应用示例/A simple application example
decode_emoji('米忽悠🐮🍺噢！已经预约了噢！不知道能不能上PS4呢')

'米忽悠:奶牛头::啤酒:噢！已经预约了噢！不知道能不能上PS4呢'

转换emoji

Transform the emoji

In [219]:
decode_text = hete_emoji['F_Comment'].apply(decode_emoji)
hete_emoji.insert(6, 'Clean_Comment' ,decode_text)

In [220]:
# 上面的例子中的emoji已经被转换为中文编码/The example shown above was successfully converted into Chinese code
'''From 🐮🍺 to :奶牛头::啤酒:'''
hete_emoji.loc[1273,'Clean_Comment']

'  米忽悠:奶牛头::啤酒:噢！已经预约了噢！不知道能不能上PS4呢[许愿][许愿][许愿]我买PS4就是为了玩原神！[太开心][太开心][太开心]'

In [221]:
hete_emoji.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype                    
---  ------         --------------  -----                    
 0   F_Idx          3850 non-null   float64                  
 1   F_CommID       3850 non-null   float64                  
 2   F_Time         3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID       3850 non-null   float64                  
 4   F_UserName     3854 non-null   object                   
 5   F_Comment      3854 non-null   object                   
 6   Clean_Comment  3854 non-null   object                   
 7   T_Idx          441 non-null    float64                  
 8   T_UserID       907 non-null    float64                  
 9   T_UserName     1191 non-null   object                   
 10  T_CommID       441 non-null    float64                  
 11  T_Time         441 non-null    datetime64[ns, UTC+08:00]
 12  T_Comment      441 n

将emoji当做中文单字的谐音词来使用、从而发表一些具有侮辱性的文本并规避社区审查的情况，在微博十分常见。在下面的例子中，“👴”表示“爷爷”，一般代指中文字“爷”。

It is common on Weibo to use emoji as homophones for some Chinese words to post abusive contents and avoid censorship. In the example below. "👴" means "grandpa" ("yeye" in Chinese), which generally refers to the Chinese word "爷" ("ye" in Chinese)

In [222]:
hete_emoji[hete_emoji['F_UserName']=='上天入地大头怪']

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,Clean_Comment,T_Idx,T_UserID,T_UserName,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
1647,2098.0,4380998609068248.0,2019-06-08 19:55:25+08:00,2841880263.0,上天入地大头怪,给👴整笑了,给:老爷爷:整笑了,,,,...,,,[鱼酋长汝此兴奋],鱼酋长汝此兴奋,,,,,,


In [223]:
# 另一个例子/Another example
hete_emoji[hete_emoji['F_UserName']=='碱性胶囊']

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,Clean_Comment,T_Idx,T_UserID,T_UserName,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
1603,2051.0,4381008448905660.0,2019-06-08 20:34:31+08:00,5486410350.0,碱性胶囊,在？为什么不出来对线,在？为什么不出来对线,,,,...,,,[赤鸢仙人我没有说谎QAQ],赤鸢仙人我没有说谎QAQ,,,,,,
1607,2055.0,4381008159805264.0,2019-06-08 20:33:23+08:00,5486410350.0,碱性胶囊,怎么还在忙着洗地吗[疑问],怎么还在忙着洗地吗[疑问],,,,...,,,[赤鸢仙人我没有说谎QAQ],赤鸢仙人我没有说谎QAQ,,,,,,
1630,2081.0,4381002900336283.0,2019-06-08 20:12:29+08:00,5486410350.0,碱性胶囊,别吵[嘻嘻]忙着给你🐎验尸呢,别吵[嘻嘻]忙着给你:马:验尸呢,,,,...,,,[赤鸢仙人我没有说谎QAQ],赤鸢仙人我没有说谎QAQ,,,,,,
2914,2065.0,4381006016311548.0,2019-06-08 20:24:52+08:00,5486410350.0,碱性胶囊,憨憨出来对线🌶️,憨憨出来对线:红辣椒:,,,,...,,,[赤鸢仙人我没有说谎QAQ],赤鸢仙人我没有说谎QAQ,,,,,,
2916,2077.0,4381004150060959.0,2019-06-08 20:17:27+08:00,5486410350.0,碱性胶囊,你🐎可能玩崩坏脑子也给崩坏了 脑仁都烂了,你:马:可能玩崩坏脑子也给崩坏了 脑仁都烂了,,,,...,,,[赤鸢仙人我没有说谎QAQ],赤鸢仙人我没有说谎QAQ,,,,,,


对于这一类emoji，我们需要将中文emoji代码再次转换为对应的中文单字，这样才能保证语义的通顺；至于那些仅作为表情符号使用、不充当句子含义的一部分的emoji，保留其中文emoji编码即可

For this type of emoji, we need to convert the Chinese emoji code to the corresponding Chinese word again, to ensure the semantics completeness; As for those emoji only used to express users' emotions rather than acting as part of the meaning of the sentence, we will keep the Chinese code.

In [224]:
code_pattern = r'(?<=:)[^:]+(?=:)'
code_match = hete_emoji['Clean_Comment'].str.findall(code_pattern)
code_match

0              []
1              []
2              []
3              []
4              []
          ...    
3849           []
3850    [奶牛头, 啤酒]
3851           []
3852           []
3853           []
Name: Clean_Comment, Length: 3854, dtype: object

In [225]:
hete_emoji['emoji_code'] = code_match
hete_emoji.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype                    
---  ------         --------------  -----                    
 0   F_Idx          3850 non-null   float64                  
 1   F_CommID       3850 non-null   float64                  
 2   F_Time         3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID       3850 non-null   float64                  
 4   F_UserName     3854 non-null   object                   
 5   F_Comment      3854 non-null   object                   
 6   Clean_Comment  3854 non-null   object                   
 7   T_Idx          441 non-null    float64                  
 8   T_UserID       907 non-null    float64                  
 9   T_UserName     1191 non-null   object                   
 10  T_CommID       441 non-null    float64                  
 11  T_Time         441 non-null    datetime64[ns, UTC+08:00]
 12  T_Comment      441 n

我们发现仅有155条文本使用了emoji。考虑到emoji和其对应的中文单字并没有一个严格的对应标准、且需要结合语境去确认，我们选择了手动将emoji转换为中文单字

There are only 155 texts with emoji. Given that (1) there is no strict correspondence between emoji and its Chinese homophone, and (2) context should be considered, we plan to convert emoji to its Chinese word manually

In [226]:
manua_code = hete_emoji[hete_emoji['emoji_code'].apply(lambda x: len(x)>0)]
manua_code

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,Clean_Comment,T_Idx,T_UserID,T_UserName,...,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7,emoji_code
22,31.0,4723616816300419.0,2022-01-09 06:37:52+08:00,5855892820.0,品咕咕,那些热评的没🐎仔怎么还在啊[doge],那些热评的没:马:仔怎么还在啊[doge],,,,...,,,,,,,,,,[马]
31,46.0,4462043262297752.0,2020-01-18 11:17:56+08:00,7358799581.0,蠢萌萌66,可恶的米忽悠，出这么个游戏，估计又要换一部📱,可恶的米忽悠，出这么个游戏，估计又要换一部:手机:,,,,...,,,,,,,,,,[手机]
49,76.0,4453030231737848.0,2019-12-24 14:23:22+08:00,6450919178.0,-iLan,为什么不送switch➕塞尔达？,为什么不送switch:加:塞尔达？,,,,...,,,,,,,,,,[加]
94,155.0,4415365171786916.0,2019-09-11 15:55:52+08:00,1937421733.0,真呼名子,🤩,:好崇拜哦:,,,,...,,,,,,,,,,[好崇拜哦]
180,347.0,4401545162546124.0,2019-08-04 12:40:06+08:00,5685068141.0,TowerNeya,gkd，我管你抄没抄，给👴整快点，👴要玩,gkd，我管你抄没抄，给:老爷爷:整快点，:老爷爷:要玩,,,,...,,,,,,,,,,"[老爷爷, 整快点，, 老爷爷]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3766,3692.0,4408484319674466.0,2019-08-23 16:13:49+08:00,7261173554.0,wirzard1,QQ飞车是你🐴原创，跑跑卡丁车不要面子喽？,QQ飞车是你:马头:原创，跑跑卡丁车不要面子喽？,,3144123235.0,醇悟,...,1957259477.0,,,,,,,,,[马头]
3769,3697.0,4403371735419651.0,2019-08-09 13:38:15+08:00,6742834214.0,HZQJSL,沾了你🐴,沾了你:马头:,,,索狗任豚的爹,...,1957259477.0,,,,,,,,,[马头]
3777,3709.0,4387155805471942.0,2019-06-25 19:41:56+08:00,5840172755.0,sex反转异装癖lesbian,我想起了之前鬼泣5抄袭崩坏3的事😂😂😂,我想起了之前鬼泣5抄袭崩坏3的事:笑哭了::笑哭了::笑哭了:,,6610932074.0,樱岛麻衣厨丶,...,1957259477.0,,,,,,,,,"[笑哭了, 笑哭了, 笑哭了]"
3787,3721.0,4386021183907910.0,2019-06-22 16:33:20+08:00,3391529434.0,Respect_SW,别洗了 辣🐔,别洗了 辣:鸡:,,6469427522.0,灰色情感的我,...,1957259477.0,,,,,,,,,[鸡]


In [227]:
manua_code = manua_code.reset_index()
manua_code = manua_code[['index','F_UserName','F_Comment','Clean_Comment','emoji_code']]
manua_code.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   index          155 non-null    int64 
 1   F_UserName     155 non-null    object
 2   F_Comment      155 non-null    object
 3   Clean_Comment  155 non-null    object
 4   emoji_code     155 non-null    object
dtypes: int64(1), object(4)
memory usage: 6.2+ KB


In [228]:
manu = 'Datasets/ManuCode.xlsx'
manua_code.to_excel(manu, index=False)
# manua_code.to_excel(manu, index=False,encoding='utf-8-sig')

读取手动编码后的excel文件

Read the excel file after manual processing

In [229]:
manua_code = pd.read_excel('Datasets/ManuCode_Finish.xlsx')
manua_code

Unnamed: 0,index,F_UserName,F_Comment,Code_Comment,Clean_Comment,emoji_code
0,22,品咕咕,那些热评的没🐎仔怎么还在啊[doge],那些热评的没妈仔怎么还在啊[doge],那些热评的没:马:仔怎么还在啊[doge],['马']
1,31,蠢萌萌66,可恶的米忽悠，出这么个游戏，估计又要换一部📱,可恶的米忽悠，出这么个游戏，估计又要换一部手机,可恶的米忽悠，出这么个游戏，估计又要换一部:手机:,['手机']
2,49,-iLan,为什么不送switch➕塞尔达？,为什么不送switch和塞尔达？,为什么不送switch:加:塞尔达？,['加']
3,94,真呼名子,🤩,[好崇拜哦],:好崇拜哦:,['好崇拜哦']
4,180,TowerNeya,gkd，我管你抄没抄，给👴整快点，👴要玩,搞快点，我管你抄没抄，给爷整快点，爷要玩,gkd，我管你抄没抄，给:老爷爷:整快点，:老爷爷:要玩,"['老爷爷', '整快点，', '老爷爷']"
...,...,...,...,...,...,...
150,3766,wirzard1,QQ飞车是你🐴原创，跑跑卡丁车不要面子喽？,QQ飞车是你妈原创，跑跑卡丁车不要面子喽？,QQ飞车是你:马头:原创，跑跑卡丁车不要面子喽？,['马头']
151,3769,HZQJSL,沾了你🐴,沾了你妈,沾了你:马头:,['马头']
152,3777,sex反转异装癖lesbian,我想起了之前鬼泣5抄袭崩坏3的事😂😂😂,我想起了之前鬼泣5抄袭崩坏3的事[笑哭了][笑哭了][笑哭了],我想起了之前鬼泣5抄袭崩坏3的事:笑哭了::笑哭了::笑哭了:,"['笑哭了', '笑哭了', '笑哭了']"
153,3787,Respect_SW,别洗了 辣🐔,别洗了 垃圾,别洗了 辣:鸡:,['鸡']


In [230]:
manua_idx = manua_code['index']
manua_text = manua_code['Code_Comment']
manua_text

0                                    那些热评的没妈仔怎么还在啊[doge]
1                                可恶的米忽悠，出这么个游戏，估计又要换一部手机
2                                       为什么不送switch和塞尔达？
3                                                 [好崇拜哦]
4                                   搞快点，我管你抄没抄，给爷整快点，爷要玩
                             ...                        
150                                QQ飞车是你妈原创，跑跑卡丁车不要面子喽？
151                                                 沾了你妈
152                      我想起了之前鬼泣5抄袭崩坏3的事[笑哭了][笑哭了][笑哭了]
153                                               别洗了 垃圾
154    牛批，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这么烦人的样子确实让我把游戏加...
Name: Code_Comment, Length: 155, dtype: object

In [231]:
hetero_data.loc[list(manua_idx),'F_Comment'] = list(manua_text)
hetero_data.loc[list(manua_idx),'F_Comment']

22                                    那些热评的没妈仔怎么还在啊[doge]
31                                可恶的米忽悠，出这么个游戏，估计又要换一部手机
49                                       为什么不送switch和塞尔达？
94                                                 [好崇拜哦]
180                                  搞快点，我管你抄没抄，给爷整快点，爷要玩
                              ...                        
3766                                QQ飞车是你妈原创，跑跑卡丁车不要面子喽？
3769                                                 沾了你妈
3777                      我想起了之前鬼泣5抄袭崩坏3的事[笑哭了][笑哭了][笑哭了]
3787                                               别洗了 垃圾
3850    牛批，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这么烦人的样子确实让我把游戏加...
Name: F_Comment, Length: 155, dtype: object

In [232]:
hetero_data.loc[list(manua_idx)]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
22,31.0,4723616816300419.0,2022-01-09 06:37:52+08:00,5855892820.0,品咕咕,那些热评的没妈仔怎么还在啊[doge],,,,,...,,,,,,,,,,
31,46.0,4462043262297752.0,2020-01-18 11:17:56+08:00,7358799581.0,蠢萌萌66,可恶的米忽悠，出这么个游戏，估计又要换一部手机,,,,,...,,,,,,,,,,
49,76.0,4453030231737848.0,2019-12-24 14:23:22+08:00,6450919178.0,-iLan,为什么不送switch和塞尔达？,,,,,...,,,,,,,,,,
94,155.0,4415365171786916.0,2019-09-11 15:55:52+08:00,1937421733.0,真呼名子,[好崇拜哦],,,,,...,,,,,,,,,,
180,347.0,4401545162546124.0,2019-08-04 12:40:06+08:00,5685068141.0,TowerNeya,搞快点，我管你抄没抄，给爷整快点，爷要玩,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3766,3692.0,4408484319674466.0,2019-08-23 16:13:49+08:00,7261173554.0,wirzard1,QQ飞车是你妈原创，跑跑卡丁车不要面子喽？,,3144123235.0,醇悟,,...,4380883840572546.0,1957259477.0,,,,,,,,
3769,3697.0,4403371735419651.0,2019-08-09 13:38:15+08:00,6742834214.0,HZQJSL,沾了你妈,,,索狗任豚的爹,,...,4380883840572546.0,1957259477.0,,,,,,,,
3777,3709.0,4387155805471942.0,2019-06-25 19:41:56+08:00,5840172755.0,sex反转异装癖lesbian,我想起了之前鬼泣5抄袭崩坏3的事[笑哭了][笑哭了][笑哭了],,6610932074.0,樱岛麻衣厨丶,,...,4380883840572546.0,1957259477.0,,,,,,,,
3787,3721.0,4386021183907910.0,2019-06-22 16:33:20+08:00,3391529434.0,Respect_SW,别洗了 垃圾,,6469427522.0,灰色情感的我,,...,4380883840572546.0,1957259477.0,,,,,,,,


### 3.5.2 解码网络用语/Decode the Internet slangs
在新浪微博（以及一些其他的中文线上社交平台），除了在Section 3.5.1中讨论过的emoji的用法之外，还会使用中文词组拼音的首字母缩写来发表侮辱性的内容。如“nmsl”即为“你妈死了”。
<br>这些词汇只能手动筛出并逐一修改

On Sina Weibo (and some other Chinese online social media platforms), abbreviations of the spelled sounds (in Chinese this is called "pinyin 拼音") of Chinese words/phrases are also used to post abusive contents. i.e., the use of emoji discussed in Section 3.5.1 is not the only way. For example, "nmsl" means "your mother is dead" (i.e., ni ma si le -> nmsl)
<br>These words can only be manually filtered out and modified one by one

nmsl

In [233]:
hetero_data[hetero_data['F_Comment'].str.contains('nmsl')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
421,673.0,4386653169053836.0,2019-06-24 10:24:37+08:00,5352600336.0,卢本开_55挂,nmsl,,,,,...,,,,,,,,,,


In [234]:
hetero_data['F_Comment'] = hetero_data['F_Comment'].str.replace('nmsl','你妈死了')

In [235]:
hetero_data.loc[421,'F_Comment']

'你妈死了'

gkd

In [236]:
hetero_data[hetero_data['F_Comment'].str.contains('gkd')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
251,487.0,4397122247520064.0,2019-07-23 07:45:01+08:00,5277946260.0,库-丘林_Alter,所以说有没有去烧米忽悠的 gkd,,,,,...,,,,,,,,,,
2605,307.0,4401935853869781.0,2019-08-05 14:32:34+08:00,7087027730.0,云夜悠长,微博人手switch人均塞尔达玩家？版权卫士gkd[嘻嘻],,,,,...,,,,,,,,,,


In [237]:
hetero_data['F_Comment'] = hetero_data['F_Comment'].str.replace('gkd','搞快点')

wdnmd

In [238]:
hetero_data[hetero_data['F_Comment'].str.contains('wdnmd')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
491,765.0,4386146941379988.0,2019-06-23 00:53:03+08:00,6486692541.0,嘬一口南瓜粥,原神看起来不错=米卫兵=脑残=啥也没玩过的工地搬砖的wdnmd[拜拜],,,,,...,,,,,,,,,,
645,942.0,4385767134498941.0,2019-06-21 23:43:51+08:00,5985268786.0,北贝95,加油奥，还有很多好游戏呢，全抄进去也许真就煮成一锅粥变成新游戏了呢[微笑]wdnmd,,,,,...,,,,,,,,,,
1817,2292.0,4380967571372984.0,2019-06-08 17:52:06+08:00,5646067604.0,Aki是乌龟,wdnmd，塞尔达游戏界天花板，你们是个什么东西？[疑问],,,,,...,,,,,,,,,,
1856,2347.0,4380959618922423.0,2019-06-08 17:20:30+08:00,3493561323.0,Jasonwrj,wdnmd 真就照着抄啊？脸都不要了？,,,,,...,,,,,,,,,,
2708,519.0,4393052338205884.0,2019-07-12 02:12:39+08:00,5679620603.0,哈维嘎嘎,抄的真棒[大拇指]，wdnmd真就四神兽也直接搬呗,,,,,...,,,,,,,,,,
3014,2575.0,4386110769400408.0,2019-06-22 22:29:20+08:00,5793575759.0,驯龙高手尹志平198102,wdnmd 人家厂商以为的的确没啥错[二哈] 你看看多少铁憨憨～,2531.0,5671232161.0,千帆幻尽,4380935199496995.0,...,4380935199496995.0,5671232161.0,,,,,,,,
3129,2687.0,4381917358889913.0,2019-06-11 08:46:13+08:00,2115615367.0,桂陵君,我昨天下午刚借我同学塞尔达重开了个号玩，然后玩了一晚上[允悲]wdnmd真有点意思啊！你说的...,,,aidow,,...,4380931105889284.0,2115615367.0,,,,,,,,


In [239]:
hetero_data['F_Comment'] = hetero_data['F_Comment'].str.replace('wdnmd','我叼你妈的')

biss

In [240]:
hetero_data[hetero_data['F_Comment'].str.contains('biss')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
189,375.0,4401364442109699.0,2019-08-04 00:41:59+08:00,7032393099.0,猫叠单数,这游戏一出biss,,,,,...,,,,,,,,,,
487,761.0,4386153824441600.0,2019-06-23 01:20:25+08:00,2076463415.0,SA口吐芬芳,还 原神 作？米忽悠这种biss钱都恰？,,,,,...,,,,,,,,,,
1106,1480.0,4381970655821345.5,2019-06-11 12:17:59+08:00,5474874791.0,cbsy85417,原神必死biss,,,,,...,,,,,,,,,,
1535,1950.0,4381035778815009.0,2019-06-08 22:23:07+08:00,6269658776.0,爵士天神老布,再抄nm今晚biss,,,,,...,,,,,,,,,,
1566,1983.0,4381020826125716.0,2019-06-08 21:23:43+08:00,5816931151.0,月影星河Qaq,抄袭biss嗷,,,,,...,,,,,,,,,,
1810,2285.0,4380969455146942.5,2019-06-08 17:59:34+08:00,5504387639.0,鸥叽叽叽,这个游戏biss嗷[嘻嘻]谁想玩速度拉黑我，不然我骂死你[太开心] 谁和我一起骂我们就是异...,,,,,...,,,[原神],原神,,,,,,
1854,2345.0,4380959930005433.5,2019-06-08 17:21:44+08:00,5538289556.0,明日方舟鸿雪,抄袭biss哦[太开心],,,,,...,,,,,,,,,,
2022,2639.0,4380932137587177.5,2019-06-08 15:31:18+08:00,5590792808.0,终极自闭人,米哈游biss[嘻嘻],,,,,...,,,,,,,,,,


In [241]:
hetero_data['F_Comment'] = hetero_data['F_Comment'].str.replace('biss','必死')

In [242]:
hetero_data[hetero_data['F_Comment'].str.contains('nb')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
1165,1540.0,4381701519802806.5,2019-06-10 18:28:32+08:00,6027499723.0,神烦g916,nb，没了,,,,,...,,,"[米哈游miHoYo, 索尼中国]",米哈游miHoYo,索尼中国,,,,,
1176,1552.0,4381674500700759.0,2019-06-10 16:41:10+08:00,2744046920.0,MO_o233,期待原神很久了 mhynb,,,,,...,,,[网易电竞NeXT],网易电竞NeXT,,,,,,
2038,2726.0,4380929998503397.0,2019-06-08 15:22:48+08:00,7185098377.0,哇咔咔11647,mihoyonb 希望得到一个。,,,,,...,,,[原神],原神,,,,,,
2820,1435.0,4404068694922232.0,2019-08-11 11:47:43+08:00,2244064047.0,廷廷且重开66,sgnb！,,6598630748.0,万肝之王EDTA,,...,4382004860350523.0,5356469720.0,,,,,,,,
2864,1815.0,4382528503130795.0,2019-06-13 01:14:41+08:00,6294377796.0,大河境36187,nb呀,1802.0,5457782776.0,污半生,4381170940729462.5,...,4381170940729462.5,5457782776.0,,,,,,,,
3286,3058.0,4382520357265613.0,2019-06-13 00:42:19+08:00,2527233773.0,吃了萤火虫会发光吗,说错了[嘻嘻]是米卫兵说的[嘻嘻]崩三能抄袭日本那个游戏是那个游戏的荣幸[嘻嘻]塞尔达也是一...,,2488554890.0,LFRitual,,...,4380899494009653.0,6578279612.0,,,,,,,,
3640,3572.0,4381006465317676.0,2019-06-08 20:26:38+08:00,3459824410.0,龙猫放弃涮羊肉,塞尔达最nb！（破音），确实也像龙之谷，我觉得是把这俩游戏合起来了，要把塞尔达完全照搬是不可...,,6116368506.0,Alyssa的糖果,,...,4380885732287240.0,6500909470.0,,,,,,,,


In [243]:
hetero_data['F_Comment'] = hetero_data['F_Comment'].str.replace('nb','牛批')

In [244]:
hetero_data.loc[2820,'F_Comment']

'sg牛批！'

In [245]:
hetero_data.loc[2820,'F_Comment'] = '爽哥牛批！'

awsl和wsl

awsl and wsl

In [246]:
hetero_data[hetero_data['F_Comment'].str.contains('awsl')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
985,1325.0,4382628998658687.0,2019-06-13 07:54:01+08:00,6912787337.0,tangjiahua26344,，awsl,,,,,...,,,[崩坏三],崩坏三,,,,,,
1648,2099.0,4380998315283579.0,2019-06-08 19:54:16+08:00,5978979527.0,破鱼PoYuu,awsl[二哈][二哈],,,,,...,,,,,,,,,,
2434,3794.0,4380881864622089.5,2019-06-08 12:11:32+08:00,5762721940.0,白木圭可大人,awsl,,,,,...,,,,,,,,,,


In [247]:
hetero_data[hetero_data['F_Comment'].str.contains('wsl')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
985,1325.0,4382628998658687.0,2019-06-13 07:54:01+08:00,6912787337.0,tangjiahua26344,，awsl,,,,,...,,,[崩坏三],崩坏三,,,,,,
1648,2099.0,4380998315283579.0,2019-06-08 19:54:16+08:00,5978979527.0,破鱼PoYuu,awsl[二哈][二哈],,,,,...,,,,,,,,,,
2418,3779.0,4380883211537085.0,2019-06-08 12:16:52+08:00,1879393202.0,寒天黑糖,好可爱的小男孩！！！！wsl,,,半透明黑桶,,...,,,"[半透明黑桶, 寒天黑糖]",半透明黑桶,,,,,,
2434,3794.0,4380881864622089.5,2019-06-08 12:11:32+08:00,5762721940.0,白木圭可大人,awsl,,,,,...,,,,,,,,,,


In [248]:
hetero_data['F_Comment'] = hetero_data['F_Comment'].str.replace('awsl','啊我死了')
hetero_data['F_Comment'] = hetero_data['F_Comment'].str.replace('wsl','我死了')

sb

In [249]:
hetero_data[hetero_data['F_Comment'].str.contains('sb')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
1803,2277.0,4380972089324269.0,2019-06-08 18:10:02+08:00,2171210083.0,月似旧时梦,sb,,,,,...,,,,,,,,,,
2761,944.0,4385765595078360.0,2019-06-21 23:37:44+08:00,5782527935.0,亚希露,我头一次看到连音效都抄的sb玩意儿 看了直播才发现还抄了尼尔[吐],,,,,...,,,,,,,,,,


In [250]:
hetero_data['F_Comment'] = hetero_data['F_Comment'].str.replace('sb','傻逼')

cnm

In [251]:
hetero_data[hetero_data['F_Comment'].str.contains('cnm')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7


nmb

In [252]:
hetero_data[hetero_data['F_Comment'].str.contains('nmb')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7


统一将对于公司“米哈游”的代名词转换为“米哈游”

Convert all synonyms for the company "米哈游" to "米哈游"

In [253]:
hetero_data[hetero_data['F_Comment'].str.contains('mhy')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
200,398.0,4401293873628857.5,2019-08-03 20:01:34+08:00,6089356377.0,袁yy06119,mhy没有妈,,,,,...,,,,,,,,,,
462,714.0,4386343918291999.0,2019-06-23 13:55:47+08:00,6296187775.0,璃喵82740,辣鸡厂商 本来对mhy印象还不错，现在已经不会再去玩你们的游戏了,,,,,...,,,,,,,,,,
1176,1552.0,4381674500700759.0,2019-06-10 16:41:10+08:00,2744046920.0,MO_o233,期待原神很久了 mhy牛批,,,,,...,,,[网易电竞NeXT],网易电竞NeXT,,,,,,
2006,2623.0,4380933643312451.0,2019-06-08 15:37:16+08:00,2717396284.0,GungnirOD,你要真能抄出个塞尔达荒野之息我都算你mhy牛逼。就怕抄几个机制完事儿，空有一张皮。自由的开放...,,,,,...,,,,,,,,,,
2565,248.0,4407172718927631.0,2019-08-20 01:22:00+08:00,7087027730.0,云夜悠长,亲亲，这边建议您直接举报原神哦，光和mhy玩家对线是没有结果滴,245.0,2140667320.0,红空岛寺,4403046332881910.5,...,4403046332881910.5,2140667320.0,,,,,,,,
2618,335.0,4407177844023700.0,2019-08-20 01:42:22+08:00,7087027730.0,云夜悠长,嗯没错，由mhy开启的中国原创游戏废土时代就此降临！,325.0,3229684874.0,吃草莓咩咩咩,4401698279516703.0,...,4401698279516703.0,3229684874.0,,,,,,,,
2641,365.0,4401394490473371.0,2019-08-04 02:41:23+08:00,2120345392.0,林启裕Jim,非常期待能在手机上能玩到这个游戏！人物模型，画面感，画风，特效渲染，在国内手机里都是巅峰级别...,,,,,...,,,,,,,,,,
2642,366.0,4407178872065791.0,2019-08-20 01:46:27+08:00,7087027730.0,云夜悠长,横版-3D战斗-开放世界/galgame（或许？没太关注未定事件簿） 鬼知道mhy下一步会干...,365.0,2120345392.0,林启裕Jim,4401394490473371.0,...,4401394490473371.0,2120345392.0,,,,,,,,
2718,722.0,4400807274973428.0,2019-08-02 11:48:00+08:00,5605946196.0,烟花飞渡似水流年61558,很明显，mhy动到大佬的蛋糕了。。。还有晚上人的脑子真的是一个人长大的,718.0,6466540211.0,厨余残渣,4386329834008973.0,...,4386329834008973.0,6466540211.0,,,,,,,,
2813,1449.0,4388669932951553.0,2019-06-29 23:58:32+08:00,6113608398.0,别开腔_自己人,索尼拿mhy当枪使,1434.0,5356469720.0,error1980,4382004860350523.0,...,4382004860350523.0,5356469720.0,,,,,,,,


In [254]:
hetero_data[hetero_data['F_Comment'].str.contains('miHoYo')]

Unnamed: 0,F_Idx,F_CommID,F_Time,F_UserID,F_UserName,F_Comment,T_Idx,T_UserID,T_UserName,T_CommID,...,Floor_CommID,Floor_UserID,At_User,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
91,152.0,4416882087892965.0,2019-09-15 20:23:33+08:00,5772842621.0,没头脑吔不高兴,miHoYo！,,,,,...,,,[青青子w],青青子w,,,,,,
923,1252.0,4383068532442949.0,2019-06-14 13:00:34+08:00,7193314317.0,Unforgiveable70756,miHoYo全新力作啊 支持支持[嘻嘻] 从崩崩崩2.6就开始支持miHoYo了呢 祝...,,,,,...,,,[KianaKaslana58529],KianaKaslana58529,,,,,,
924,1253.0,4383067756515710.5,2019-06-14 12:57:29+08:00,7193323484.0,KianaKaslana58529,miHoYo全新力作啊 支持支持[鼓掌],,,,,...,,,[Unforgiveable70756],Unforgiveable70756,,,,,,


In [255]:
hetero_data['F_Comment'] = hetero_data['F_Comment'].str.replace('mhy','米哈游')
hetero_data['F_Comment'] = hetero_data['F_Comment'].str.replace('miHoYo','米哈游')

# 4. 重组数据集2：转变为异构图需要的DataFrame形式/The second dataset reorganization: convert datasets to forms meet HeteroG construction requirements

在前面的章节里，我们已经完成了所有的数据处理。因此，在该章节，我们可以开始处理前面的数据集，并将其转换为HeteroG需要的形式

In the previous sections, we finished all data processing. Therefore, in this section, we can start to handle datasets obtained and reorganize them to the form  that HeteroG requires

## 4.1 用户互动网/user-user interaction network
- 这一步包括
  - 提取出用户互动列（F_UserName, F_UserID, T_UserName, T_UserID, Floor_UserID, At_U1~At_U7）
  - 对profile_data进行扩充：所有没有profile的user均append到profile_data最后一行后
  - 对完善后的profile_data的user重新设置索引作为“用户节点索引”（user_nIdx）
  - 重组数据集，使得一行只有一个From和一个To
    - 对于仅发表评论的孤立点，则只有From，To为空值
    - 当To不为空时，To则为To的数据
    - 当To为空时，To为Floor的数据
    - At独立于To和Floor，无需额外的处理
  - 将user_nIdx附加到前述步骤得到的用户互动DataFrame、以及用户特征DataFrame里
  - 将用户互动DataFrame、以及用户特征DataFrame转换成张量。
  

- This Section will include
  - Extract all columns related to the user interaction (F_UserName, F_UserID, T_UserName, T_UserID, Floor_UserID, At_U1~At_U7)
  - Extend `profile_data`: all users with missing profiles are appended to the last row of `profile_data`
  - Set the user node index "user_nIdx" for all users in the refined `profile_data` (this will be viewed as the user feature dataset)
  - Reorganize the dataset so that there is only one From and one To in a row
    - If the user is an isolated point (i.e., User type 1 mentioned in Section 4.1.1), then To is null; Or otherwise:
    - If To is not null, then do nothing
    - If To is null, then fill the To will Floor
    - At is independent of To and Floor, no additional processing required
  - Append user_nIdx to the user inetraction DataFrame obtained above
  - Convert user interaction dataset and user feature dataset into Tensor

### 4.1.1 提取用户互动信息并存储为新的DataFrame/Extract the user interaction information and save it as a new DataFrame

可以移步Section 4.1.2，直接读取数据集

You can turn to Section 4.1.2 and read the dataset directly

In [256]:
hetero_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 23 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   F_Idx         3850 non-null   float64                  
 1   F_CommID      3850 non-null   float64                  
 2   F_Time        3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID      3850 non-null   float64                  
 4   F_UserName    3854 non-null   object                   
 5   F_Comment     3854 non-null   object                   
 6   T_Idx         441 non-null    float64                  
 7   T_UserID      907 non-null    float64                  
 8   T_UserName    1191 non-null   object                   
 9   T_CommID      441 non-null    float64                  
 10  T_Time        441 non-null    datetime64[ns, UTC+08:00]
 11  T_Comment     441 non-null    object                   
 12  Floor_Idx     1186 non-null   floa

In [257]:
user_user = hetero_data.iloc[:,[3,4,7,8,14,16,17,18,19,20,21,22]]
user_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   F_UserID      3850 non-null   float64
 1   F_UserName    3854 non-null   object 
 2   T_UserID      907 non-null    float64
 3   T_UserName    1191 non-null   object 
 4   Floor_UserID  1186 non-null   float64
 5   At_U1         1008 non-null   object 
 6   At_U2         56 non-null     object 
 7   At_U3         10 non-null     object 
 8   At_U4         2 non-null      object 
 9   At_U5         1 non-null      object 
 10  At_U6         1 non-null      object 
 11  At_U7         1 non-null      object 
dtypes: float64(3), object(9)
memory usage: 361.4+ KB


In [258]:
userInter='Datasets/UserInteraction.csv'
user_user.to_csv(userInter, index=False,encoding='utf-8-sig')

### 4.1.2 读取`user_user`：用户互动原始表/Read `user_user`: the raw interaction DataFrame

In [259]:
user_user=pd.read_csv('Datasets/UserInteraction.csv')
user_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   F_UserID      3850 non-null   float64
 1   F_UserName    3854 non-null   object 
 2   T_UserID      907 non-null    float64
 3   T_UserName    1191 non-null   object 
 4   Floor_UserID  1186 non-null   float64
 5   At_U1         1008 non-null   object 
 6   At_U2         56 non-null     object 
 7   At_U3         10 non-null     object 
 8   At_U4         2 non-null      object 
 9   At_U5         1 non-null      object 
 10  At_U6         1 non-null      object 
 11  At_U7         1 non-null      object 
dtypes: float64(3), object(9)
memory usage: 361.4+ KB


### 4.1.3 将Floor UserName链接到`user_user`/Attach the Floor UserName to `user_user`

微博不存在重复用户名，所以UserID可以直接删去

Duplicate usernames are not allowed on Weibo. Therefore, UserID is unnecessary to be kept and can be removed

In [260]:
user_user.drop(columns=['F_UserID','T_UserID'],inplace=True)
user_user

Unnamed: 0,F_UserName,T_UserName,Floor_UserID,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
0,祎只狸猫,,,,,,,,,
1,邮一棵草莓i,,,,,,,,,
2,kmimg7,,,,,,,,,
3,小祀弟弟吖,,,,,,,,,
4,不知道如何评价,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
3849,Aki是乌龟,怕事先改名shaw,6201564748.0,,,,,,,
3850,兔纸今天能摸到鱼吗,沉迷艾欧泽亚的菜菜啊,6201564748.0,,,,,,,
3851,兔纸今天能摸到鱼吗,沙奈朵的裙底到底有什么,6201564748.0,,,,,,,
3852,淐馮,元首的胖次00658,6940897092.0,,,,,,,


在Section 3.2里定义了`name_id`

`name_id` is defined in Section 3.2

In [261]:
name_id

Unnamed: 0,UserID,UserName
0,1001914040,薪火鹏
1,1008309912,提尔乌斯
2,1025900974,猫的摇篮-伪物
3,1028179843,非常神奇的老z
4,1035744261,假装很强的萌新
...,...,...
2798,7755717663,寂月海200007
2799,7766444420,烛虚cron
2800,7772408887,bo_白色大月亮
2801,7774567481,你好陈博


In [262]:
# 融合/Merge
user_user = pd.merge(user_user, name_id, how='left', left_on='Floor_UserID',right_on='UserID')
user_user

Unnamed: 0,F_UserName,T_UserName,Floor_UserID,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7,UserID,UserName
0,祎只狸猫,,,,,,,,,,,
1,邮一棵草莓i,,,,,,,,,,,
2,kmimg7,,,,,,,,,,,
3,小祀弟弟吖,,,,,,,,,,,
4,不知道如何评价,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
3849,Aki是乌龟,怕事先改名shaw,6201564748.0,,,,,,,,6201564748.0,兔纸今天能摸到鱼吗
3850,兔纸今天能摸到鱼吗,沉迷艾欧泽亚的菜菜啊,6201564748.0,,,,,,,,6201564748.0,兔纸今天能摸到鱼吗
3851,兔纸今天能摸到鱼吗,沙奈朵的裙底到底有什么,6201564748.0,,,,,,,,6201564748.0,兔纸今天能摸到鱼吗
3852,淐馮,元首的胖次00658,6940897092.0,,,,,,,,6940897092.0,元首的胖次00658


In [263]:
floor_Uname = user_user['UserName']
user_user.drop(columns=['Floor_UserID','UserID','UserName'],inplace=True)

In [264]:
user_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   F_UserName  3854 non-null   object
 1   T_UserName  1191 non-null   object
 2   At_U1       1008 non-null   object
 3   At_U2       56 non-null     object
 4   At_U3       10 non-null     object
 5   At_U4       2 non-null      object
 6   At_U5       1 non-null      object
 7   At_U6       1 non-null      object
 8   At_U7       1 non-null      object
dtypes: object(9)
memory usage: 271.1+ KB


In [265]:
user_user.insert(2,'Floor_UserName',floor_Uname)
user_user

Unnamed: 0,F_UserName,T_UserName,Floor_UserName,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
0,祎只狸猫,,,,,,,,,
1,邮一棵草莓i,,,,,,,,,
2,kmimg7,,,,,,,,,
3,小祀弟弟吖,,,,,,,,,
4,不知道如何评价,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
3849,Aki是乌龟,怕事先改名shaw,兔纸今天能摸到鱼吗,,,,,,,
3850,兔纸今天能摸到鱼吗,沉迷艾欧泽亚的菜菜啊,兔纸今天能摸到鱼吗,,,,,,,
3851,兔纸今天能摸到鱼吗,沙奈朵的裙底到底有什么,兔纸今天能摸到鱼吗,,,,,,,
3852,淐馮,元首的胖次00658,元首的胖次00658,,,,,,,


In [266]:
user_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   F_UserName      3854 non-null   object
 1   T_UserName      1191 non-null   object
 2   Floor_UserName  1186 non-null   object
 3   At_U1           1008 non-null   object
 4   At_U2           56 non-null     object
 5   At_U3           10 non-null     object
 6   At_U4           2 non-null      object
 7   At_U5           1 non-null      object
 8   At_U6           1 non-null      object
 9   At_U7           1 non-null      object
dtypes: object(10)
memory usage: 301.2+ KB


### 4.1.4 构建最终的用户数据库（合计3783名用户），并获得最终的用户特征数据集/Construct the final user database (3783 users in total) and obtain the final user features dataset

<font color='red'><b>请注意，由于该section的第一部分（即 (1) 准备工作/Preparation）中使用了`set`来得到独特的用户名列表，用户append的操作将无法复现。尽管`set`可以提取出所有的独特用户名，但是无法保证每次运行的顺序。因此在该jupyter文件（是初始文件的整合版）中，请不要运行该章节下的(1) 和 (2) 中的代码。请直接转至(3)读取最开始的jupyter文件版本（同样也是后续训练模型的版本）中保存的文件。如果您有其他可复现的方法，欢迎告知我</b></font>

<font color='skyblue'><b>Please note that in the first part (i.e., (1) Preparation) of this section in this Jupyter file, the use of `set` to obtain a unique list of usernames may result in irreproducible appending operations. While `set` can extract all unique usernames, it cannot guarantee the order of the obtained list each time. Therefore, in this Jupyter file (the integrated version of the initial file), DO NOT run the code in (1) and (2). Please turn to (3) to read the file saved in the initial Jupyter file version (which is also used in the subsequent model training). If you have alternative reproducible methods, please feel free to share them with me :)</b></font>

#### (1) 准备工作/Preparation

确认总用户数

Identify the total number of users (including 2 parts: with profiles and without profiles)

In [383]:
uni_uName = set()
for c in user_user.columns:
    uni_uName |= set(user_user[c].dropna())

all_user = list(uni_uName)
len(all_user)

3783

为了简化GNNs中的one-hot层结构以及避免地域歧视，ProvinceCode将被drop掉。

To simplified the one-hot structre in GNNs and avoid geographical discrimination, ProvinceCode will be droped

In [347]:
profile_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   UserID          2803 non-null   int64 
 1   IndexList       2803 non-null   object
 2   UserName        2803 non-null   object
 3   TotalComment    2803 non-null   int64 
 4   Comment         2803 non-null   int64 
 5   Reply           2803 non-null   int64 
 6   LikeCount       2803 non-null   int64 
 7   ProvinceCode    2803 non-null   int64 
 8   Description     2803 non-null   int64 
 9   DescriptionLen  2803 non-null   int64 
 10  SpecialChar     2803 non-null   int64 
 11  UserGender      2803 non-null   int64 
 12  UserFan         2803 non-null   int64 
 13  UserFollow      2803 non-null   int64 
 14  UserWeibo       2803 non-null   int64 
 15  UserVerified    2803 non-null   int64 
 16  LoyalFan        2803 non-null   int64 
 17  VipRank         2803 non-null   int64 
dtypes: int64

In [348]:
len(profile_data['ProvinceCode'].value_counts())

35

In [349]:
user_feature = profile_data.iloc[:,[2,8,9,10,11,12,13,14,15,16,17]]
user_feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2803 entries, 0 to 2802
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   UserName        2803 non-null   object
 1   Description     2803 non-null   int64 
 2   DescriptionLen  2803 non-null   int64 
 3   SpecialChar     2803 non-null   int64 
 4   UserGender      2803 non-null   int64 
 5   UserFan         2803 non-null   int64 
 6   UserFollow      2803 non-null   int64 
 7   UserWeibo       2803 non-null   int64 
 8   UserVerified    2803 non-null   int64 
 9   LoyalFan        2803 non-null   int64 
 10  VipRank         2803 non-null   int64 
dtypes: int64(10), object(1)
memory usage: 241.0+ KB


In [350]:
user_feature.head()

Unnamed: 0,UserName,Description,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank
0,薪火鹏,1,17,2,1,30,56,692,0,0,0
1,提尔乌斯,1,4,0,1,0,0,0,0,0,0
2,猫的摇篮-伪物,1,11,1,1,917,1109,2013,0,0,0
3,非常神奇的老z,0,0,0,1,13,59,9,0,0,0
4,假装很强的萌新,1,10,0,1,14,34,107,0,1,0


探索了VipRank各值的分布情况后，可以发现VipRank=0占了大多数。将VipRank做成新的二分变量Vip：
- Vip=0: Vip等级为0
- Vip=1：Vip等级不为0

After exploring the distribution of values for VipRank, we can find that VipRank=0 is in the majority. Convert VipRank to a new binary variable Vip:
- Vip=0: VipRank=0
- Vip=1: VipRank≠0

In [352]:
user_feature['VipRank'].value_counts().sort_index()

VipRank
0    2595
1      22
2       8
3       8
4      10
5      18
6      65
7      67
8      10
Name: count, dtype: int64

In [353]:
user_feature['Vip'] = 0
vip_idx = list(user_feature[user_feature['VipRank']!=0].index)
user_feature.loc[vip_idx, 'Vip'] = 1
user_feature['Vip'].value_counts().sort_index()

Vip
0    2595
1     208
Name: count, dtype: int64

将缺失了user profile的用户append到`user_feature`

Append users without profiles to the `user_feature`

In [288]:
miss_name = [un for un in all_user if un not in user_feature['UserName'].values]
user_feature = pd.concat([user_feature, pd.DataFrame({'UserName': miss_name})]).reset_index()
user_feature

Unnamed: 0,index,UserName,Description,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Vip
0,0,薪火鹏,1.0,17.0,2.0,1.0,30.0,56.0,692.0,0.0,0.0,0.0,0.0
1,1,提尔乌斯,1.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,猫的摇篮-伪物,1.0,11.0,1.0,1.0,917.0,1109.0,2013.0,0.0,0.0,0.0,0.0
3,3,非常神奇的老z,0.0,0.0,0.0,1.0,13.0,59.0,9.0,0.0,0.0,0.0,0.0
4,4,假装很强的萌新,1.0,10.0,0.0,1.0,14.0,34.0,107.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3778,975,宋宋Monkey,,,,,,,,,,,
3779,976,Shadow-201201,,,,,,,,,,,
3780,977,我是你晗大大,,,,,,,,,,,
3781,978,普莱米亚姆,,,,,,,,,,,


设置user node index：“user_nIdx”

Set the user node index "user_nIdx"

In [289]:
user_feature = user_feature.reset_index()
user_feature

Unnamed: 0,level_0,index,UserName,Description,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Vip
0,0,0,薪火鹏,1.0,17.0,2.0,1.0,30.0,56.0,692.0,0.0,0.0,0.0,0.0
1,1,1,提尔乌斯,1.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,2,猫的摇篮-伪物,1.0,11.0,1.0,1.0,917.0,1109.0,2013.0,0.0,0.0,0.0,0.0
3,3,3,非常神奇的老z,0.0,0.0,0.0,1.0,13.0,59.0,9.0,0.0,0.0,0.0,0.0
4,4,4,假装很强的萌新,1.0,10.0,0.0,1.0,14.0,34.0,107.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3778,3778,975,宋宋Monkey,,,,,,,,,,,
3779,3779,976,Shadow-201201,,,,,,,,,,,
3780,3780,977,我是你晗大大,,,,,,,,,,,
3781,3781,978,普莱米亚姆,,,,,,,,,,,


In [325]:
user_feature.drop(columns='index',inplace=True)

In [291]:
user_feature.rename(columns={'level_0':'user_nIdx'},inplace=True)
user_feature

Unnamed: 0,user_nIdx,UserName,Description,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Vip
0,0,薪火鹏,1.0,17.0,2.0,1.0,30.0,56.0,692.0,0.0,0.0,0.0,0.0
1,1,提尔乌斯,1.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,猫的摇篮-伪物,1.0,11.0,1.0,1.0,917.0,1109.0,2013.0,0.0,0.0,0.0,0.0
3,3,非常神奇的老z,0.0,0.0,0.0,1.0,13.0,59.0,9.0,0.0,0.0,0.0,0.0
4,4,假装很强的萌新,1.0,10.0,0.0,1.0,14.0,34.0,107.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3778,3778,宋宋Monkey,,,,,,,,,,,
3779,3779,Shadow-201201,,,,,,,,,,,
3780,3780,我是你晗大大,,,,,,,,,,,
3781,3781,普莱米亚姆,,,,,,,,,,,


可以发现在user_nIdx=2803后的用户，所有的特征全为空值。这980个用户在GNNs里将被用于embedding层以得到伪profile

For users with user_nIdx larger than 2802, all features are NaN. All these 980 users will be handled via the embedding layer in GNNs to obtain the pseudo profiles

In [327]:
user_feature[user_feature['Description'].isna()]

Unnamed: 0,user_nIdx,UserName,Description,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Vip
2803,2803,折花绯雪,,,,,,,,,,,
2804,2804,守护舒宝,,,,,,,,,,,
2805,2805,-Miserable-,,,,,,,,,,,
2806,2806,Arne晓暮,,,,,,,,,,,
2807,2807,Fooooooooo星,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3778,3778,看到尽头以前就一起走吧,,,,,,,,,,,
3779,3779,Acer宏碁,,,,,,,,,,,
3780,3780,nlhsmkt_399,,,,,,,,,,,
3781,3781,HellonmbdKity,,,,,,,,,,,


#### (2) 保存处理后的用户特征数据集/Save the processed user feature dataset `user_feature`

可以跳转至下一个Section直接读取数据

You can turn to the next Section to read the processed `user_feature` directly

In [328]:
user_feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3783 entries, 0 to 3782
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   user_nIdx       3783 non-null   int64  
 1   UserName        3783 non-null   object 
 2   Description     2803 non-null   float64
 3   DescriptionLen  2803 non-null   float64
 4   SpecialChar     2803 non-null   float64
 5   UserGender      2803 non-null   float64
 6   UserFan         2803 non-null   float64
 7   UserFollow      2803 non-null   float64
 8   UserWeibo       2803 non-null   float64
 9   UserVerified    2803 non-null   float64
 10  LoyalFan        2803 non-null   float64
 11  VipRank         2803 non-null   float64
 12  Vip             2803 non-null   float64
dtypes: float64(11), int64(1), object(1)
memory usage: 384.3+ KB


In [329]:
feature = 'Datasets/UserFeature_Graph.csv'
user_feature.to_csv(feature, index=False,encoding='utf-8-sig')

#### (3) 读取数据：user_feature 用户特征数据集/Read the data: `user_feature` user feature dataset

In [267]:
user_feature = pd.read_csv('Datasets/UserFeature_Graph.csv')
user_feature

Unnamed: 0,user_nIdx,UserName,Description,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Vip
0,0,薪火鹏,1.0,17.0,2.0,1.0,30.0,56.0,692.0,0.0,0.0,0.0,0.0
1,1,提尔乌斯,1.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,猫的摇篮-伪物,1.0,11.0,1.0,1.0,917.0,1109.0,2013.0,0.0,0.0,0.0,0.0
3,3,非常神奇的老z,0.0,0.0,0.0,1.0,13.0,59.0,9.0,0.0,0.0,0.0,0.0
4,4,假装很强的萌新,1.0,10.0,0.0,1.0,14.0,34.0,107.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3778,3778,宋宋Monkey,,,,,,,,,,,
3779,3779,Shadow-201201,,,,,,,,,,,
3780,3780,我是你晗大大,,,,,,,,,,,
3781,3781,普莱米亚姆,,,,,,,,,,,


#### (4) 将用户特征数据集转换为Tensor/Transform the user feature dataset (including necessary features) to Tensor

这一步包括：
- 提取所需的特征列
- 将user_nIdx为0~2802的全部用户的特征转换成tensor
- user_nIdx=2803后的全部用户将在GNNs的embedding layer进行处理
- user_nIdx=2803后的全部用户将被赋予一个新的索引，用以后续embedding layer

This section includes:
- Extract features we need
- Transform thoese features of the users with user_nIdx $\in [0, 2802]$ to Tensor
- Users with user_nIdx $\in [2803, 3782]$ will be handled via embedding layer of GNNs; Each user will be given a new index for the subsequent embedding layer

提取相关列，并转换为tensor、保存

Extract related columns and transform it to Tensor. Save the Tensor

In [268]:
user_f = torch.from_numpy(user_feature.iloc[0:2803,[2,3,4,5,6,7,8,9,10,12]].values).to(torch.float)
user_f

tensor([[ 1., 17.,  2.,  ...,  0.,  0.,  0.],
        [ 1.,  4.,  0.,  ...,  0.,  0.,  0.],
        [ 1., 11.,  1.,  ...,  0.,  0.,  0.],
        ...,
        [ 0.,  0.,  0.,  ...,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  ...,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  ...,  0.,  0.,  0.]])

In [271]:
torch.save(user_f, "Tensor/user_f.pt")

In [272]:
user_f.shape

torch.Size([2803, 10])

再次检查，确认Tensor无误

Double check and confirm that the Tensor is correct

In [273]:
user_feature.loc[2790:2802]

Unnamed: 0,user_nIdx,UserName,Description,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Vip
2790,2790,邮一棵草莓i,1.0,3.0,3.0,0.0,2.0,48.0,124.0,0.0,0.0,0.0,0.0
2791,2791,獅Yue,0.0,0.0,0.0,0.0,0.0,53.0,1.0,0.0,0.0,0.0,0.0
2792,2792,不知道如何评价,0.0,0.0,0.0,1.0,0.0,55.0,0.0,0.0,0.0,0.0,0.0
2793,2793,EJoachim,1.0,13.0,4.0,0.0,0.0,41.0,27.0,0.0,1.0,0.0,0.0
2794,2794,邱赵琳,0.0,0.0,0.0,1.0,0.0,32.0,49.0,0.0,0.0,0.0,0.0
2795,2795,祎只狸猫,1.0,8.0,0.0,0.0,45.0,149.0,260.0,0.0,1.0,0.0,0.0
2796,2796,名字就叫旧林,0.0,0.0,0.0,0.0,2.0,4.0,4.0,0.0,0.0,0.0,0.0
2797,2797,花月歌浮舟,0.0,0.0,0.0,0.0,60.0,122.0,1.0,0.0,0.0,0.0,0.0
2798,2798,寂月海200007,0.0,0.0,0.0,1.0,0.0,10.0,2.0,0.0,0.0,0.0,0.0
2799,2799,烛虚cron,1.0,5.0,7.0,0.0,1.0,38.0,37.0,0.0,0.0,0.0,0.0


In [274]:
user_f[2790:2802]

tensor([[  1.,   3.,   3.,   0.,   2.,  48., 124.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,   0.,  53.,   1.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   1.,   0.,  55.,   0.,   0.,   0.,   0.],
        [  1.,  13.,   4.,   0.,   0.,  41.,  27.,   0.,   1.,   0.],
        [  0.,   0.,   0.,   1.,   0.,  32.,  49.,   0.,   0.,   0.],
        [  1.,   8.,   0.,   0.,  45., 149., 260.,   0.,   1.,   0.],
        [  0.,   0.,   0.,   0.,   2.,   4.,   4.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,  60., 122.,   1.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   1.,   0.,  10.,   2.,   0.,   0.,   0.],
        [  1.,   5.,   7.,   0.,   1.,  38.,  37.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   1.,   0.,  50.,   6.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   1.,   0.,   7.,   0.,   0.,   0.,   0.]])

在user_nIdx=2803及之后的用户，将被给予一个用于embedding layer的index。<br>
将该index列转换为Tensor并保存

Users with user_nIdx $\in [2803, 3782]$ will be given the new indices for the subsequent embedding layer <br>
Transform the list of index to Tensor and save

In [275]:
no_profile_Num = len(user_feature.loc[2803:])
user_embed_idx = torch.tensor(range(no_profile_Num))
user_embed_idx

tensor([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
         14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
         28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,
         42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
         56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,
         70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,
         84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,
         98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
        112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
        126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
        140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153,
        154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167,
        168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 1

In [277]:
torch.save(user_embed_idx, "Tensor/user_embed_idx.pt")

变量名回顾
- user_f：已有profile的user的features，tensor格式。包含2803个node，shape为2803x10（即10 features）
- user_embed_idx：未知profile的user的index，合计980个。该idx是按照embedding layer的要求建立的（从0~979）。将被送到embedding layer里进行后续处理
- user_feature['user_nIdx']: 可以提取node的unique idx（即user node index），将被用于构建图

Variable name review
- user_f: Tensor; Features for users with profiles; Includes 2803 nodes, with shape: $2803 \times 10$ (i.e., 10 features)
- user_embed_idx: Tensor; 980 indices of users without profiles; Created following the embedding layer requirement (i.e., Idx from 0 to 979); Will be conducted in the embedding layer
- user_feature['user_nIdx']: A column in the DataFrame; Store all user nodes indices (from 0 to 3782); Will be used to construct User Interaction Graph

### 4.1.5 重组数据集，使得一行只有一个From和一个To/Reorganize the dataset so that there is only one From and one To in a row
创建一个新的DataFrame

Create a new DataFrame

In [278]:
column_names = ['F_UserName','F_nIdx','T_UserName','T_nIdx']
user_inter = pd.DataFrame(columns=column_names)
user_inter

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx


In [279]:
user_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   F_UserName      3854 non-null   object
 1   T_UserName      1191 non-null   object
 2   Floor_UserName  1186 non-null   object
 3   At_U1           1008 non-null   object
 4   At_U2           56 non-null     object
 5   At_U3           10 non-null     object
 6   At_U4           2 non-null      object
 7   At_U5           1 non-null      object
 8   At_U6           1 non-null      object
 9   At_U7           1 non-null      object
dtypes: object(10)
memory usage: 301.2+ KB


#### (1) 孤立点和Floor点：仅Isolated node and Floor node: Only F_UserName is non-null

孤立点的特征在于，在该社群网络里，他们未与其他用户互动；而Floor点在发布作为Floor结构开始的第一条文本时，他/她也没有和其他人互动。<br>
因此，只要除了index=0的列`F_UserName`之外全为空值的点，就可以被归为这2类点
<br>可以直接移步Section 4.1.5 (5)读取数据

Characteristics of isolated nodes: there are no interactions between them and other users in this community network;<br>
Characteristics of floor nodes: when they post the 1$^{\text{st}}$ text viewed as the start of Floor structure, there is also no interaction.<br>
Therefore, any node that has all null values except for the column `F_UserName` can be categorized as one of these 2 types of nodes<br>
You can move to Section 4.1.5 (5) to read the dataset

In [280]:
set1 = user_user[user_user.iloc[:, 1:].isna().all(axis=1)]
set1

Unnamed: 0,F_UserName,T_UserName,Floor_UserName,At_U1,At_U2,At_U3,At_U4,At_U5,At_U6,At_U7
0,祎只狸猫,,,,,,,,,
1,邮一棵草莓i,,,,,,,,,
2,kmimg7,,,,,,,,,
3,小祀弟弟吖,,,,,,,,,
4,不知道如何评价,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
3563,哦伊哇伊伊,,,,,,,,,
3571,oO土豆泥O,,,,,,,,,
3737,鸥洗恩,,,,,,,,,
3827,hakumei_surfing_ver,,,,,,,,,


In [281]:
user_inter.loc[:,'F_UserName'] = list(set1['F_UserName'])

In [282]:
user_inter

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx
0,祎只狸猫,,,
1,邮一棵草莓i,,,
2,kmimg7,,,
3,小祀弟弟吖,,,
4,不知道如何评价,,,
...,...,...,...,...
1658,哦伊哇伊伊,,,
1659,oO土豆泥O,,,
1660,鸥洗恩,,,
1661,hakumei_surfing_ver,,,


#### (2) 其他点：有互动信息/Other nodes: with interactions

在上面的Section里我们已经提取出了所有没有互动信息的记录。那么，只需要对上述部分的索引取反，即可得到所有的“有互动信息”的数据。<br>
我们没有对不同的互动类型（如reply和mention/@）进行细分，因此直接使用melt并做后续处理，即可达成“一行只有一个From和一个To”的目标


All records without interactions information are extracted in the Section above. Therefore, negating indices in the Section above can obtain all data with interactions <br>
We haven't differentiated between different types of interactions (such as reply and mention/@). Therefore, by simply using the `melt` and performing additional processing, we can achieve the goal of "only one From and one To in a row"

In [283]:
set1_idx = list(set1.index)
len(set1_idx)

1663

In [284]:
set2 = user_user.loc[~user_user.index.isin(set1_idx)]
len(set2)

2191

In [285]:
set2_inter = set2.melt(id_vars = ['F_UserName'],
                       value_vars = ['T_UserName',
                                     'At_U1','At_U2','At_U3','At_U4','At_U5','At_U6','At_U7'],
                       var_name = 'InterType',
                       value_name = 'InterUser')
set2_inter

Unnamed: 0,F_UserName,InterType,InterUser
0,花月歌浮舟,T_UserName,
1,原神忠粉,T_UserName,
2,DodLIke刂兆,T_UserName,
3,wangxorz,T_UserName,
4,葱花不开花,T_UserName,
...,...,...,...
17523,Aki是乌龟,At_U7,
17524,兔纸今天能摸到鱼吗,At_U7,
17525,兔纸今天能摸到鱼吗,At_U7,
17526,淐馮,At_U7,


In [286]:
set2_inter.dropna(inplace=True)
set2_inter

Unnamed: 0,F_UserName,InterType,InterUser
50,老老老老老那啊,T_UserName,我再也不想写代码了
155,十香Princess2018,T_UserName,糖醋SAO排骨
249,昊哥昊哥耗,T_UserName,花盆栽柳树谁也拦不住
950,寒天黑糖,T_UserName,半透明黑桶
976,bo_白色大月亮,T_UserName,养着兔子的猫咪
...,...,...,...
8803,Pancras_Loe盧鵬州,At_U4,腾讯电竞
10072,本壹,At_U4,拉不拉奇
12263,本壹,At_U5,_啊翔翔翔翔翔
14454,本壹,At_U6,NormanFuckingRockwell


In [287]:
set2_targ = set2_inter[['F_UserName','InterUser']]
user_inter = pd.concat([user_inter, set2_targ]).reset_index()
user_inter

Unnamed: 0,index,F_UserName,F_nIdx,T_UserName,T_nIdx,InterUser
0,0,祎只狸猫,,,,
1,1,邮一棵草莓i,,,,
2,2,kmimg7,,,,
3,3,小祀弟弟吖,,,,
4,4,不知道如何评价,,,,
...,...,...,...,...,...,...
3928,8803,Pancras_Loe盧鵬州,,,,腾讯电竞
3929,10072,本壹,,,,拉不拉奇
3930,12263,本壹,,,,_啊翔翔翔翔翔
3931,14454,本壹,,,,NormanFuckingRockwell


In [288]:
user_inter.drop(columns='index',inplace=True)
user_inter

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx,InterUser
0,祎只狸猫,,,,
1,邮一棵草莓i,,,,
2,kmimg7,,,,
3,小祀弟弟吖,,,,
4,不知道如何评价,,,,
...,...,...,...,...,...
3928,Pancras_Loe盧鵬州,,,,腾讯电竞
3929,本壹,,,,拉不拉奇
3930,本壹,,,,_啊翔翔翔翔翔
3931,本壹,,,,NormanFuckingRockwell


通过上面的分析，我们知道从DataFrame索引1663开始，就是来自set2的数据：即有用户互动的数据

From the analysis above, it is clear that all records from `set2` (i.e., all data with user interactions) are saved from the DataFrame index=1663

In [289]:
user_inter.loc[1662:1669]

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx,InterUser
1662,兔纸今天能摸到鱼吗,,,,
1663,老老老老老那啊,,,,我再也不想写代码了
1664,十香Princess2018,,,,糖醋SAO排骨
1665,昊哥昊哥耗,,,,花盆栽柳树谁也拦不住
1666,寒天黑糖,,,,半透明黑桶
1667,bo_白色大月亮,,,,养着兔子的猫咪
1668,夕神心音,,,,书一点禾
1669,bo_白色大月亮,,,,罗小黑本喵_


对该部分user：将InterUser的值移动到T_UserName

For those users: "Move" the value of InterUser to T_UserName

In [290]:
user_inter.loc[1663:,'T_UserName'] = user_inter.loc[1663:,'InterUser']
user_inter.loc[1662:1669]

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx,InterUser
1662,兔纸今天能摸到鱼吗,,,,
1663,老老老老老那啊,,我再也不想写代码了,,我再也不想写代码了
1664,十香Princess2018,,糖醋SAO排骨,,糖醋SAO排骨
1665,昊哥昊哥耗,,花盆栽柳树谁也拦不住,,花盆栽柳树谁也拦不住
1666,寒天黑糖,,半透明黑桶,,半透明黑桶
1667,bo_白色大月亮,,养着兔子的猫咪,,养着兔子的猫咪
1668,夕神心音,,书一点禾,,书一点禾
1669,bo_白色大月亮,,罗小黑本喵_,,罗小黑本喵_


In [291]:
user_inter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3933 entries, 0 to 3932
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   F_UserName  3933 non-null   object
 1   F_nIdx      0 non-null      object
 2   T_UserName  2270 non-null   object
 3   T_nIdx      0 non-null      object
 4   InterUser   2270 non-null   object
dtypes: object(5)
memory usage: 153.8+ KB


删去多余的列

Drop the redundant column `InterUser`

In [292]:
user_inter.drop(columns='InterUser',inplace=True)
user_inter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3933 entries, 0 to 3932
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   F_UserName  3933 non-null   object
 1   F_nIdx      0 non-null      object
 2   T_UserName  2270 non-null   object
 3   T_nIdx      0 non-null      object
dtypes: object(4)
memory usage: 123.0+ KB


#### (3) 将user_nIdx附加到`user_inter`/Attach user_nIdx to the `user_inter`
从Section 4.1.4 (3)的user_feature提取user_nIdx

Extract user_nIdx from user_feature in Section 4.1.4 (3)

In [293]:
user_feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3783 entries, 0 to 3782
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   user_nIdx       3783 non-null   int64  
 1   UserName        3783 non-null   object 
 2   Description     2803 non-null   float64
 3   DescriptionLen  2803 non-null   float64
 4   SpecialChar     2803 non-null   float64
 5   UserGender      2803 non-null   float64
 6   UserFan         2803 non-null   float64
 7   UserFollow      2803 non-null   float64
 8   UserWeibo       2803 non-null   float64
 9   UserVerified    2803 non-null   float64
 10  LoyalFan        2803 non-null   float64
 11  VipRank         2803 non-null   float64
 12  Vip             2803 non-null   float64
dtypes: float64(11), int64(1), object(1)
memory usage: 384.3+ KB


In [294]:
user_merge = user_feature[['user_nIdx','UserName']]
user_merge

Unnamed: 0,user_nIdx,UserName
0,0,薪火鹏
1,1,提尔乌斯
2,2,猫的摇篮-伪物
3,3,非常神奇的老z
4,4,假装很强的萌新
...,...,...
3778,3778,宋宋Monkey
3779,3779,Shadow-201201
3780,3780,我是你晗大大
3781,3781,普莱米亚姆


In [295]:
user_inter = pd.merge(user_inter, user_merge, how='left', left_on='F_UserName',right_on='UserName')
user_inter = pd.merge(user_inter, user_merge, how='left', left_on='T_UserName',right_on='UserName')
user_inter

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx,user_nIdx_x,UserName_x,user_nIdx_y,UserName_y
0,祎只狸猫,,,,2795,祎只狸猫,,
1,邮一棵草莓i,,,,2790,邮一棵草莓i,,
2,kmimg7,,,,2789,kmimg7,,
3,小祀弟弟吖,,,,2783,小祀弟弟吖,,
4,不知道如何评价,,,,2792,不知道如何评价,,
...,...,...,...,...,...,...,...,...
3928,Pancras_Loe盧鵬州,,腾讯电竞,,2477,Pancras_Loe盧鵬州,3345.0,腾讯电竞
3929,本壹,,拉不拉奇,,2500,本壹,3284.0,拉不拉奇
3930,本壹,,_啊翔翔翔翔翔,,2500,本壹,3691.0,_啊翔翔翔翔翔
3931,本壹,,NormanFuckingRockwell,,2500,本壹,3211.0,NormanFuckingRockwell


使数据集更加可读：x -> F，y -> T

Make the dataset more readable: x -> F，y -> T

In [296]:
f_nIdx = user_inter['user_nIdx_x']
t_nIdx = user_inter['user_nIdx_y']
user_inter['F_nIdx'] = f_nIdx
user_inter['T_nIdx'] = t_nIdx
user_inter

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx,user_nIdx_x,UserName_x,user_nIdx_y,UserName_y
0,祎只狸猫,2795,,,2795,祎只狸猫,,
1,邮一棵草莓i,2790,,,2790,邮一棵草莓i,,
2,kmimg7,2789,,,2789,kmimg7,,
3,小祀弟弟吖,2783,,,2783,小祀弟弟吖,,
4,不知道如何评价,2792,,,2792,不知道如何评价,,
...,...,...,...,...,...,...,...,...
3928,Pancras_Loe盧鵬州,2477,腾讯电竞,3345.0,2477,Pancras_Loe盧鵬州,3345.0,腾讯电竞
3929,本壹,2500,拉不拉奇,3284.0,2500,本壹,3284.0,拉不拉奇
3930,本壹,2500,_啊翔翔翔翔翔,3691.0,2500,本壹,3691.0,_啊翔翔翔翔翔
3931,本壹,2500,NormanFuckingRockwell,3211.0,2500,本壹,3211.0,NormanFuckingRockwell


In [297]:
user_inter.drop(columns=['user_nIdx_x','UserName_x','user_nIdx_y','UserName_y'],inplace=True)
user_inter

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx
0,祎只狸猫,2795,,
1,邮一棵草莓i,2790,,
2,kmimg7,2789,,
3,小祀弟弟吖,2783,,
4,不知道如何评价,2792,,
...,...,...,...,...
3928,Pancras_Loe盧鵬州,2477,腾讯电竞,3345.0
3929,本壹,2500,拉不拉奇,3284.0
3930,本壹,2500,_啊翔翔翔翔翔,3691.0
3931,本壹,2500,NormanFuckingRockwell,3211.0


In [298]:
user_inter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3933 entries, 0 to 3932
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   F_UserName  3933 non-null   object 
 1   F_nIdx      3933 non-null   int64  
 2   T_UserName  2270 non-null   object 
 3   T_nIdx      2270 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 123.0+ KB


#### (4) 保存user_inter/Save user_inter as a CSV file

In [299]:
interaction = 'Datasets/UserInter_Graph.csv'
user_inter.to_csv(interaction, index=False,encoding='utf-8-sig')

#### (5) 读取数据：user_inter/Read user_inter
user_inter仅包含用户互动信息（用户名和用户节点索引）

user_inter only includes user interactions information (i.e., Username and user node index)

In [300]:
user_inter = pd.read_csv('Datasets/UserInter_Graph.csv')
user_inter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3933 entries, 0 to 3932
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   F_UserName  3933 non-null   object 
 1   F_nIdx      3933 non-null   int64  
 2   T_UserName  2270 non-null   object 
 3   T_nIdx      2270 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 123.0+ KB


In [302]:
user_inter

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx
0,祎只狸猫,2795,,
1,邮一棵草莓i,2790,,
2,kmimg7,2789,,
3,小祀弟弟吖,2783,,
4,不知道如何评价,2792,,
...,...,...,...,...
3928,Pancras_Loe盧鵬州,2477,腾讯电竞,3345.0
3929,本壹,2500,拉不拉奇,3284.0
3930,本壹,2500,_啊翔翔翔翔翔,3691.0
3931,本壹,2500,NormanFuckingRockwell,3211.0


#### (6) 将用户互动数据集转换为Tensor/Convert the user interaction dataset `user_inter` to Tensor

In [319]:
userInter_data = torch.from_numpy(user_inter[['F_nIdx','T_nIdx']].values).T
userInter_data

tensor([[2795., 2790., 2789.,  ..., 2500., 2500., 2500.],
        [  nan,   nan,   nan,  ..., 3691., 3211., 3661.]], dtype=torch.float64)

In [304]:
userInter_data.shape

torch.Size([2, 3933])

保存

Save

In [310]:
torch.save(userInter_data, "Tensor/userInter_data.pt")

In [311]:
userInter_data

tensor([[2795., 2790., 2789.,  ..., 2500., 2500., 2500.],
        [  nan,   nan,   nan,  ..., 3691., 3211., 3661.]], dtype=torch.float64)

## 4.2 用户评论所属关系/user-comment affliation relationship

这一部分包括：
- 给予处理后的文本Comment Idx用于HeteroG的构建
- 将用户的user_nIdx附加到user-comment的数据集
- 将处理好的user-comment数据集转换为tensor
- 对文本使用transform库转为vector表达；该vector将被用于HeteroG中，comment节点的属性

This Section includes:
- Coment Idx will be given to all processed texts for HeteroG construction
- Attach user_nIdx to the user-comment dataset
- Transform the processed user-comment to Tensor
- Convert all texts into vector representation via transformers library; These vectors will be used as attributes of comment nodes in HeteroG

### 4.2.1 保存：hetero_data，一个包含了所有的文本与互动信息的数据集/Save hetero_data: a dataset includes all texts and interaction information

可以移步到Section 4.2.2直接读取数据集

Turn to Section 4.2.2 to read the dataset directly

In [322]:
hetero_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 23 columns):
 #   Column        Non-Null Count  Dtype                    
---  ------        --------------  -----                    
 0   F_Idx         3850 non-null   float64                  
 1   F_CommID      3850 non-null   float64                  
 2   F_Time        3850 non-null   datetime64[ns, UTC+08:00]
 3   F_UserID      3850 non-null   float64                  
 4   F_UserName    3854 non-null   object                   
 5   F_Comment     3854 non-null   object                   
 6   T_Idx         441 non-null    float64                  
 7   T_UserID      907 non-null    float64                  
 8   T_UserName    1191 non-null   object                   
 9   T_CommID      441 non-null    float64                  
 10  T_Time        441 non-null    datetime64[ns, UTC+08:00]
 11  T_Comment     441 non-null    object                   
 12  Floor_Idx     1186 non-null   floa

In [323]:
all_info = 'Datasets/Sina_allInfo.csv'
hetero_data.to_csv(all_info, index=False,encoding='utf-8-sig')

### 4.2.2 读取hetero_data，并提取必要的信息/Read hetero_data and extract the necessary columns

In [324]:
hetero_data = pd.read_csv('Datasets/Sina_allInfo.csv')
hetero_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 23 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   F_Idx         3850 non-null   float64
 1   F_CommID      3850 non-null   float64
 2   F_Time        3850 non-null   object 
 3   F_UserID      3850 non-null   float64
 4   F_UserName    3854 non-null   object 
 5   F_Comment     3847 non-null   object 
 6   T_Idx         441 non-null    float64
 7   T_UserID      907 non-null    float64
 8   T_UserName    1191 non-null   object 
 9   T_CommID      441 non-null    float64
 10  T_Time        441 non-null    object 
 11  T_Comment     441 non-null    object 
 12  Floor_Idx     1186 non-null   float64
 13  Floor_CommID  1186 non-null   float64
 14  Floor_UserID  1186 non-null   float64
 15  At_User       1010 non-null   object 
 16  At_U1         1008 non-null   object 
 17  At_U2         56 non-null     object 
 18  At_U3         10 non-null   

user-comment网络体现的是用户及其所发布的文本的所属关系。一个用户可以发布多条文本<br>
因此在构建网络时，只需要用户的辨认信息以及对应的文本内容即可

user-comment graph indicates the affiliation relationship between one certain user and texts posted by him/her. A user can post multiple texts<br>
Therefore, only user ID (e.g., username or user node index) and their texts are required when constructing the graph

In [328]:
user_comment = hetero_data.loc[:,['F_UserName','F_Comment']]
user_comment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3854 entries, 0 to 3853
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   F_UserName  3854 non-null   object
 1   F_Comment   3847 non-null   object
dtypes: object(2)
memory usage: 60.3+ KB


transformers库中的Chinese-bert无法处理空值，但空格对于该模型tokenization的结果没有任何影响。<br>
因此，我们将空值替换为空格即可。


Although Chinese-Bert in transformers library cannot handle null values, it's tokenization results will not be affected by spaces <br>
Therefore, replace all null values with spaces.

In [329]:
user_comment[user_comment['F_Comment'].isna()]

Unnamed: 0,F_UserName,F_Comment
913,十香Princess2018,
914,糖醋SAO排骨,
1178,心镜D,
1319,不爱发博王左军,
1658,天雷牙皇,
2268,幸运的血玫瑰男爵,
2694,兔兔谈谈,


In [330]:
no_idx = list(user_comment[user_comment['F_Comment'].isna()].index)
user_comment.loc[no_idx,'F_Comment'] = ' '
user_comment[user_comment['F_Comment'].isna()]

Unnamed: 0,F_UserName,F_Comment


### 4.2.3 构建user-comment数据集/Construct user-comment dataset

#### (1) 添加user node index和comment node index/Add the user node index and comment node index

In [331]:
user_feature

Unnamed: 0,user_nIdx,UserName,Description,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Vip
0,0,薪火鹏,1.0,17.0,2.0,1.0,30.0,56.0,692.0,0.0,0.0,0.0,0.0
1,1,提尔乌斯,1.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,猫的摇篮-伪物,1.0,11.0,1.0,1.0,917.0,1109.0,2013.0,0.0,0.0,0.0,0.0
3,3,非常神奇的老z,0.0,0.0,0.0,1.0,13.0,59.0,9.0,0.0,0.0,0.0,0.0
4,4,假装很强的萌新,1.0,10.0,0.0,1.0,14.0,34.0,107.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3778,3778,宋宋Monkey,,,,,,,,,,,
3779,3779,Shadow-201201,,,,,,,,,,,
3780,3780,我是你晗大大,,,,,,,,,,,
3781,3781,普莱米亚姆,,,,,,,,,,,


借助UserName将user_nIdx附加到user_comment上

Attach user_nIdx to the user_comment via UserName

In [332]:
user_nIdx = user_feature[['user_nIdx','UserName']]
user_comment = pd.merge(user_comment, user_nIdx, how='left',left_on = 'F_UserName', right_on = 'UserName' )
user_comment

Unnamed: 0,F_UserName,F_Comment,user_nIdx,UserName
0,祎只狸猫,考古,2795,祎只狸猫
1,邮一棵草莓i,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,2790,邮一棵草莓i
2,kmimg7,真不知道迅哥给评论区投了多少米？[允悲][doge],2789,kmimg7
3,小祀弟弟吖,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],2783,小祀弟弟吖
4,不知道如何评价,[吃瓜]现在是2023年 回来考古的点个赞,2792,不知道如何评价
...,...,...,...,...
3849,Aki是乌龟,你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就不出续作了，我给您磕头了，...,1685,Aki是乌龟
3850,兔纸今天能摸到鱼吗,牛批，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这么烦人的样子确实让我把游戏加...,2200,兔纸今天能摸到鱼吗
3851,兔纸今天能摸到鱼吗,到我这条底下阴阳怪气啥？觉得抄袭了去评论前面几条啊，去和米卫兵吵啊，我又没发什么过激言论。再...,2200,兔纸今天能摸到鱼吗
3852,淐馮,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...,2435,淐馮


In [333]:
user_idx = user_comment['user_nIdx']
user_comment = user_comment.drop(columns=['user_nIdx','UserName'],axis=1)
user_comment.insert(1,'User_nIdx',user_idx)
user_comment

Unnamed: 0,F_UserName,User_nIdx,F_Comment
0,祎只狸猫,2795,考古
1,邮一棵草莓i,2790,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难
2,kmimg7,2789,真不知道迅哥给评论区投了多少米？[允悲][doge]
3,小祀弟弟吖,2783,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪]
4,不知道如何评价,2792,[吃瓜]现在是2023年 回来考古的点个赞
...,...,...,...
3849,Aki是乌龟,1685,你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就不出续作了，我给您磕头了，...
3850,兔纸今天能摸到鱼吗,2200,牛批，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这么烦人的样子确实让我把游戏加...
3851,兔纸今天能摸到鱼吗,2200,到我这条底下阴阳怪气啥？觉得抄袭了去评论前面几条啊，去和米卫兵吵啊，我又没发什么过激言论。再...
3852,淐馮,2435,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...


给每一条文本设置索引作为comment node index（Comment_nIdx）

Set the comment node index (Comment_nIdx) for each text

In [334]:
user_comment['Comment_nIdx'] = list(user_comment.index)
user_comment

Unnamed: 0,F_UserName,User_nIdx,F_Comment,Comment_nIdx
0,祎只狸猫,2795,考古,0
1,邮一棵草莓i,2790,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,1
2,kmimg7,2789,真不知道迅哥给评论区投了多少米？[允悲][doge],2
3,小祀弟弟吖,2783,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],3
4,不知道如何评价,2792,[吃瓜]现在是2023年 回来考古的点个赞,4
...,...,...,...,...
3849,Aki是乌龟,1685,你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就不出续作了，我给您磕头了，...,3849
3850,兔纸今天能摸到鱼吗,2200,牛批，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这么烦人的样子确实让我把游戏加...,3850
3851,兔纸今天能摸到鱼吗,2200,到我这条底下阴阳怪气啥？觉得抄袭了去评论前面几条啊，去和米卫兵吵啊，我又没发什么过激言论。再...,3851
3852,淐馮,2435,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...,3852


#### (2) 保存user_comment/Save user_comment

可以跳转至下一个Section（Section 4.2.3 (3)）直接读取数据

You can turn to Section 4.2.3 (3) to read the dataset directly

In [335]:
upc = 'Datasets/UserPost.csv'
user_comment.to_csv(upc, index=False,encoding='utf-8-sig')

#### (3) 读取user_comment：用户与文本对应关系/Read user_comment: the correspondence between users and texts 

In [337]:
user_comment = pd.read_csv('Datasets/UserPost.csv')

In [338]:
user_comment

Unnamed: 0,F_UserName,User_nIdx,F_Comment,Comment_nIdx
0,祎只狸猫,2795,考古,0
1,邮一棵草莓i,2790,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,1
2,kmimg7,2789,真不知道迅哥给评论区投了多少米？[允悲][doge],2
3,小祀弟弟吖,2783,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],3
4,不知道如何评价,2792,[吃瓜]现在是2023年 回来考古的点个赞,4
...,...,...,...,...
3849,Aki是乌龟,1685,你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就不出续作了，我给您磕头了，...,3849
3850,兔纸今天能摸到鱼吗,2200,牛批，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这么烦人的样子确实让我把游戏加...,3850
3851,兔纸今天能摸到鱼吗,2200,到我这条底下阴阳怪气啥？觉得抄袭了去评论前面几条啊，去和米卫兵吵啊，我又没发什么过激言论。再...,3851
3852,淐馮,2435,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...,3852


#### (4) 提取user-comment的索引，并转换成Tensor/Extract the user-comment index pair, and convert it to Tensor

userPost_comm将被用来构建HeteroG中的user-comment图

userPost_comm will be used to construct the user-comment graph in HeteroG

In [339]:
userPost_comm = torch.from_numpy(user_comment[['User_nIdx','Comment_nIdx']].values).T
userPost_comm

tensor([[2795, 2790, 2789,  ..., 2200, 2435, 2677],
        [   0,    1,    2,  ..., 3851, 3852, 3853]])

保存并读取

Save and read

In [340]:
torch.save(userPost_comm, "Tensor/userPost_comm.pt")

In [341]:
userPost_comm = torch.load("Tensor/userPost_comm.pt")
userPost_comm

tensor([[2795, 2790, 2789,  ..., 2200, 2435, 2677],
        [   0,    1,    2,  ..., 3851, 3852, 3853]])

### 4.2.4 使用transforms库的预训练模型处理文本数据/Handle textual data via pre-trained model in transformers library

#### (1) 提取文本列/Extract the text column F_Comment

In [343]:
user_comment

Unnamed: 0,F_UserName,User_nIdx,F_Comment,Comment_nIdx
0,祎只狸猫,2795,考古,0
1,邮一棵草莓i,2790,来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难,1
2,kmimg7,2789,真不知道迅哥给评论区投了多少米？[允悲][doge],2
3,小祀弟弟吖,2783,入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪],3
4,不知道如何评价,2792,[吃瓜]现在是2023年 回来考古的点个赞,4
...,...,...,...,...
3849,Aki是乌龟,1685,你这可把爷整乐了，求求你了，一定要玩，不玩任天堂就倒闭了，塞尔达就不出续作了，我给您磕头了，...,3849
3850,兔纸今天能摸到鱼吗,2200,牛批，哪怕最后原神实锤抄袭我不玩我也绝对不会去玩塞尔达呢，你们这么烦人的样子确实让我把游戏加...,3850
3851,兔纸今天能摸到鱼吗,2200,到我这条底下阴阳怪气啥？觉得抄袭了去评论前面几条啊，去和米卫兵吵啊，我又没发什么过激言论。再...,3851
3852,淐馮,2435,唉嘿，我觉得在微博里逛逛，然后投诉一下那些素质低下的人还挺有趣的，只是不知道投诉有没有用，毕...,3852


In [344]:
comment_list = list(user_comment['F_Comment'])
comment_list

['考古',
 '来考古了，怎么说，知道以前原神舆论正义很大，但亲眼看到前面的评论才能有点真实感觉到当时有多难',
 '真不知道迅哥给评论区投了多少米？[允悲][doge]',
 '入坑很晚想来看看之前的pv，这么多骂抄袭的吗？一开始这么难吗[泪]',
 '[吃瓜]现在是2023年 回来考古的点个赞',
 '考古',
 '就搞不懂了，怎么有些人这么愤世嫉俗[doge]那他平日得有多大成就啊[泪]',
 '考古结束 ',
 'Windows 11“狗都不用”时代前来考古[doge]',
 '卧槽，这时候就有绫华了！？',
 '我就是米卫兵，原神没抄袭',
 '回来看看梦开始的地方',
 '3年了 3年',
 '考古',
 '参观古战场[666]',
 '神里不愧是亲女儿[允悲]回来看看就热评打脸现场',
 '底下骂抄袭的，我来扫墓啦[打call][打call]',
 '考古[鲜花]',
 '继续考古，温迪！万叶！你们快来吧！',
 '前来考古，2022.3.5，原神加油，米哈游加油！',
 '我惊了，这么早就把神里绫华放出来了？[二哈][二哈]',
 '希望原神越来越好',
 '那些热评的没妈仔怎么还在啊[doge]',
 '评论区真有趣[doge]',
 '原来神里大小姐这么早就出现在pv里了',
 '加油[喵喵]',
 '2021年8月13日报道，我想说原神真的很好玩，会做的越来越好的，加油！',
 '回头来看，热评哈哈哈哈哈哈哈哈哈',
 '加油(ง •̀_•́)ง！！！',
 '求大伟哥给个资格',
 '作为米哈游科技（上海）有限公司制作发行的一款开放世界冒险游戏，画面风格精致富有美感，是一款自由度很高的手游大作，我真心想获得本次测试，也会抓住每一个机会，我会努力争取的！[太开心]  旅行者，安柏，凯亚，琴，丽莎，芭芭拉，到时候一定与你们结友！共度时光  加油！#原神#',
 '可恶的米忽悠，出这么个游戏，估计又要换一部手机',
 '等私信  ',
 '  来个测试资格吧[二哈]',
 '明显的有人带节奏...守望先锋被抄没人闹，吃鸡基本全民的游戏网易抄袭也不见有人当正义使者，然后我睡一觉起来全中国就我一个人没玩过塞尔达了（转载的）',
 '  嘿嘿，栗子大大出来装个好友帮个忙[太开心][太开心]',
 '原神登录NS了，要不再来个暴躁大哥砸个NS助助兴?[d

确认可用设备：CPU/GPU （请根据自己的设备修改）

Confirm available devices: CPU/GPU (please modify according to your own device)

In [345]:
gpu = "cuda:0" if torch.cuda.is_available() else "cpu"
gpu

'cuda:0'

后续步骤包括：
- 确认要使用的transforms库中预训练的模型（该数据集中的文本均为中文，因此选择bert-base-chinese）
- 将comment_list作为tokenizer的输入
- 将token后的文本输入预训练模型里，得到所有文本的向量表达

The following steps include:
- Confirm which pre-trained model in the transforms library should be used (regarding this case, all texts in the dataset are Chinese, so bert-base-chinese will be appropriate)
- Fit comment_list into the selected model's tokenizer
- Fit the tokenized texts into the selected model to obtain the vector representation of all textual data

#### (2) 使用选定模型及其tokenizer处理文本数据/Use the selected model and its tokenizer to handle textual data

<font color="red">注意：下面的代码块运行后在我的电脑上出现了来自jupyter-widgets的bug，但该bug不影响后续HeteroG的构建、bert-base-chinese的使用、以及提出的模型的训练<font>

<font color="yellow">Note: Although the Code cell below led to some errors from jupyter-widgets when running on my PC, this will not affect the subsequent operations, including HeteroG consturction, the use of bert-base-chinese, and the training of our proposed model<font>

In [346]:
torch.manual_seed(0)
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
bert_model = BertModel.from_pretrained('bert-base-chinese')

'''移动到cuda上/Put the model and input to cuda'''
bert_model = bert_model.to(gpu)
bert_inputs = bert_tokenizer(comment_list,return_tensors="pt",padding=True).to(gpu)

Downloading vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/412M [00:00<?, ?B/s]

为避免显存溢出，将bert_inputs分成多个small batches后、再逐一送入模型（bert-base-chinese）里

To avoid memory overflow, `bert_inputs` will be allocated into small batches and fed into the model (bert-base-chinese) one by one.

In [348]:
bert_inputs

{'input_ids': tensor([[ 101, 5440, 1367,  ...,    0,    0,    0],
        [ 101, 3341, 5440,  ...,    0,    0,    0],
        [ 101, 4696,  679,  ...,    0,    0,    0],
        ...,
        [ 101, 1168, 2769,  ...,    0,    0,    0],
        [ 101, 1536, 1678,  ...,    0,    0,    0],
        [ 101, 1744,  772,  ...,    0,    0,    0]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}

In [349]:
bert_ids = bert_inputs['input_ids']
bert_mask = bert_inputs['attention_mask']
bert_Tids = bert_inputs['token_type_ids']

# 定义一个函数来分批/Define a function to allocate batch
def batch_collate(batch_size,ids,mask,tIds):
    batch_idx = list(range(0,3841,batch_size))
    batch_list =[]
    for i in batch_idx:
        if i!=3840:
            batch_dict = {
                'input_ids': ids[i : i+batch_size],
                'attention_mask': mask[i : i+batch_size],
                'token_type_ids': tIds[i : i+batch_size]
            }
        else:
            batch_dict = {
                'input_ids': ids[i:],
                'attention_mask': mask[i:],
                'token_type_ids': tIds[i:]
            }
        batch_list.append(batch_dict)
    return batch_list

In [350]:
# 分批/Allocate the batch
batch_set = batch_collate(16,bert_ids,bert_mask,bert_Tids)

In [351]:
len(batch_set)

241

In [352]:
# 逐批送入模型里/Feed each batch one by one into the model
torch.manual_seed(0)
all_commentV = []
with torch.no_grad():
    for mb in range(len(batch_set)):
        batch_output = bert_model(**batch_set[mb])
        bacth_vector = batch_output.pooler_output
        all_commentV.append(bacth_vector)

In [353]:
len(all_commentV)

241

In [354]:
all_commentV

[tensor([[ 0.9940,  0.9999,  0.9899,  ..., -0.9991, -0.8844,  0.9776],
         [ 0.9999,  1.0000,  0.9957,  ..., -0.9897, -0.9999,  0.9136],
         [ 0.9979,  0.9972,  0.9994,  ..., -0.9849, -0.9998, -0.9075],
         ...,
         [ 0.9940,  0.9999,  0.9899,  ..., -0.9991, -0.8844,  0.9776],
         [ 0.9987,  0.9998,  0.9924,  ..., -0.9989, -0.9774,  0.7461],
         [ 0.9998,  1.0000,  0.9998,  ..., -0.9989, -0.9998,  0.8627]],
        device='cuda:0'),
 tensor([[ 0.9999,  1.0000,  0.9999,  ..., -0.9980, -0.9996, -0.8641],
         [ 0.9997,  0.9999,  0.9992,  ..., -0.9995, -0.9926,  0.7948],
         [ 0.9999,  0.9999,  0.9995,  ..., -0.9962, -0.9987,  0.7980],
         ...,
         [ 0.9982,  0.9999,  0.9998,  ..., -0.9928, -0.9925,  0.8479],
         [ 0.9993,  0.9992,  0.9948,  ..., -0.9950, -0.9997,  0.3598],
         [ 0.9999,  1.0000,  0.9997,  ..., -0.9969, -0.9997,  0.9621]],
        device='cuda:0'),
 tensor([[ 0.9975,  0.9993,  0.9997,  ..., -0.9984, -0.9681,  0.91

In [355]:
comment_vector = torch.cat(all_commentV,dim=0)
comment_vCpu = comment_vector.to('cpu')
comment_vCpu

tensor([[ 0.9940,  0.9999,  0.9899,  ..., -0.9991, -0.8844,  0.9776],
        [ 0.9999,  1.0000,  0.9957,  ..., -0.9897, -0.9999,  0.9136],
        [ 0.9979,  0.9972,  0.9994,  ..., -0.9849, -0.9998, -0.9075],
        ...,
        [ 0.9989,  1.0000,  0.9834,  ..., -0.9420, -0.9996,  0.1584],
        [ 0.9998,  1.0000,  0.9997,  ..., -0.9540, -0.9999, -0.4217],
        [ 0.9996,  0.9997,  0.9980,  ..., -0.9992, -0.9944, -0.1119]])

存储文本向量表达，清空GPU内存

Save the vector representation of textual data, empty the GPU memory

In [356]:
torch.cuda.empty_cache()

In [357]:
torch.save(comment_vCpu, "Tensor/Comment_Vector.pt")

### 4.2.5 读取向量表征/Read vector representation `Comment_Vector`

In [358]:
comm_f = torch.load("Tensor/Comment_Vector.pt")
comm_f

tensor([[ 0.9940,  0.9999,  0.9899,  ..., -0.9991, -0.8844,  0.9776],
        [ 0.9999,  1.0000,  0.9957,  ..., -0.9897, -0.9999,  0.9136],
        [ 0.9979,  0.9972,  0.9994,  ..., -0.9849, -0.9998, -0.9075],
        ...,
        [ 0.9989,  1.0000,  0.9834,  ..., -0.9420, -0.9996,  0.1584],
        [ 0.9998,  1.0000,  0.9997,  ..., -0.9540, -0.9999, -0.4217],
        [ 0.9996,  0.9997,  0.9980,  ..., -0.9992, -0.9944, -0.1119]])

In [360]:
print(f"Allocated GPU: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Reversed GPU: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

Allocated GPU: 0.43 GB
Reversed GPU: 0.48 GB


# 5. 准备我们提出的HG-PD模型的输入：异构图和标准化的用户特征/Prepare the input data for our proposed model HG-PD: HeteroG construction and standardized user features

在这个部分我们将
- 构建异构图，包含2种节点（用户 与 评论）和2种边（有向的用户互动 与 无向的用户评论所属）；其中评论节点属性为经过bert-base-chinese得到的文本向量表达
- 对用户特征进行标准化

In this section we will
- Construct the HeteroG, which includes 2 types of nodes (user and comment) and 2 types of edges (directed user interaction and undirected user comment affiliation)；The comment nodes' attributes are the text vector representation obtained from bert-base-chinese
- Standardize the user features

## 5.1 异构图/HeteroG (Heterogeneous graph) construction

### 5.1.1 孤立点：自连接/Isolated user node: self connection

为了确保孤立点也能被GNNs正确的处理，为所有的孤立点加上自连接

To ensure all isolated user nodes can be handled by GNNs correctly, self-connections are added.

In [361]:
user_inter

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx
0,祎只狸猫,2795,,
1,邮一棵草莓i,2790,,
2,kmimg7,2789,,
3,小祀弟弟吖,2783,,
4,不知道如何评价,2792,,
...,...,...,...,...
3928,Pancras_Loe盧鵬州,2477,腾讯电竞,3345.0
3929,本壹,2500,拉不拉奇,3284.0
3930,本壹,2500,_啊翔翔翔翔翔,3691.0
3931,本壹,2500,NormanFuckingRockwell,3211.0


In [362]:
self_idx = list(user_inter[user_inter['T_nIdx'].isna()].index)
user_inter.loc[self_idx,'T_nIdx'] = user_inter.loc[self_idx,'F_nIdx']
user_inter.loc[self_idx]

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx
0,祎只狸猫,2795,,2795.0
1,邮一棵草莓i,2790,,2790.0
2,kmimg7,2789,,2789.0
3,小祀弟弟吖,2783,,2783.0
4,不知道如何评价,2792,,2792.0
...,...,...,...,...
1658,哦伊哇伊伊,1549,,1549.0
1659,oO土豆泥O,2409,,2409.0
1660,鸥洗恩,379,,379.0
1661,hakumei_surfing_ver,796,,796.0


In [363]:
user_inter['T_nIdx'] = user_inter['T_nIdx'].astype(int)
user_inter

Unnamed: 0,F_UserName,F_nIdx,T_UserName,T_nIdx
0,祎只狸猫,2795,,2795
1,邮一棵草莓i,2790,,2790
2,kmimg7,2789,,2789
3,小祀弟弟吖,2783,,2783
4,不知道如何评价,2792,,2792
...,...,...,...,...
3928,Pancras_Loe盧鵬州,2477,腾讯电竞,3345
3929,本壹,2500,拉不拉奇,3284
3930,本壹,2500,_啊翔翔翔翔翔,3691
3931,本壹,2500,NormanFuckingRockwell,3211


处理完毕，更新用户互动Tensor

Finish the self-connetion; Update the user interaction Tensor as userInter_self

In [364]:
userInter_self = torch.from_numpy(user_inter[['F_nIdx','T_nIdx']].values).T
userInter_self

tensor([[2795, 2790, 2789,  ..., 2500, 2500, 2500],
        [2795, 2790, 2789,  ..., 3691, 3211, 3661]])

保存

Save

In [365]:
torch.save(userInter_self, "Tensor/userInter_self.pt")

### 5.1.2 HeteroData与异构图构建/HeteroData and HeteroG construction

user-comment：从一开始的有向图做成无向图；加一个反向图，即comment-user即可

user-comment: make it from the initial directed one to the undirected via creating an inverse graph (i.e., comment-user)

In [368]:
user_Node = torch.from_numpy(user_feature['user_nIdx'].values).T
user_Node

tensor([   0,    1,    2,  ..., 3780, 3781, 3782])

In [369]:
comm_Node = torch.from_numpy(user_comment['Comment_nIdx'].values).T
comm_Node

tensor([   0,    1,    2,  ..., 3851, 3852, 3853])

In [370]:
uc0 = userPost_comm[0]
uc1 = userPost_comm[1]

In [371]:
commFrom_user = torch.vstack((uc1, uc0))
commFrom_user

tensor([[   0,    1,    2,  ..., 3851, 3852, 3853],
        [2795, 2790, 2789,  ..., 2200, 2435, 2677]])

In [372]:
userPost_comm

tensor([[2795, 2790, 2789,  ..., 2200, 2435, 2677],
        [   0,    1,    2,  ..., 3851, 3852, 3853]])

In [373]:
heteroData = HeteroData()
heteroData['user'].node_id = user_Node

heteroData['comment'].node_id = comm_Node
heteroData['comment'].x = comm_f

heteroData['user','interact','user'].edge_index = userInter_self
heteroData['user','post','comment'].edge_index = userPost_comm
heteroData['comment','from','user'].edge_index = commFrom_user

heteroData

HeteroData(
  [1muser[0m={ node_id=[3783] },
  [1mcomment[0m={
    node_id=[3854],
    x=[3854, 768]
  },
  [1m(user, interact, user)[0m={ edge_index=[2, 3933] },
  [1m(user, post, comment)[0m={ edge_index=[2, 3854] },
  [1m(comment, from, user)[0m={ edge_index=[2, 3854] }
)

可以查看heteroData的相关信息
- x_dict: 节点属性。目前存储了comment节点的信息；用户节点将在模型部分传入
- edge_index_dict：指示了node对之间的关系，即边

You can check the information of heteroData
- x_dict: node attributes. Comment nodes' attributes have been stored in the heteroData, while user nodes will be fed in the model Section
- edge_index_dict: indicates edges between pairs of nodes.

In [374]:
heteroData.x_dict

{'comment': tensor([[ 0.9940,  0.9999,  0.9899,  ..., -0.9991, -0.8844,  0.9776],
         [ 0.9999,  1.0000,  0.9957,  ..., -0.9897, -0.9999,  0.9136],
         [ 0.9979,  0.9972,  0.9994,  ..., -0.9849, -0.9998, -0.9075],
         ...,
         [ 0.9989,  1.0000,  0.9834,  ..., -0.9420, -0.9996,  0.1584],
         [ 0.9998,  1.0000,  0.9997,  ..., -0.9540, -0.9999, -0.4217],
         [ 0.9996,  0.9997,  0.9980,  ..., -0.9992, -0.9944, -0.1119]])}

In [375]:
heteroData.edge_index_dict

{('user',
  'interact',
  'user'): tensor([[2795, 2790, 2789,  ..., 2500, 2500, 2500],
         [2795, 2790, 2789,  ..., 3691, 3211, 3661]]),
 ('user',
  'post',
  'comment'): tensor([[2795, 2790, 2789,  ..., 2200, 2435, 2677],
         [   0,    1,    2,  ..., 3851, 3852, 3853]]),
 ('comment',
  'from',
  'user'): tensor([[   0,    1,    2,  ..., 3851, 3852, 3853],
         [2795, 2790, 2789,  ..., 2200, 2435, 2677]])}

In [376]:
userInter_self

tensor([[2795, 2790, 2789,  ..., 2500, 2500, 2500],
        [2795, 2790, 2789,  ..., 3691, 3211, 3661]])

保存

Save

In [377]:
torch.save(heteroData,'Tensor/HeteroData.pt')

## 5.2 将用户特征标准化/Standardize user features

我们需要将前面得到的用户特征（tensor）中、数字型变量标准化

We should standardize all numeric variables in the user features (`user_f`) obtained previously

In [391]:
user_f

tensor([[ 1., 17.,  2.,  ...,  0.,  0.,  0.],
        [ 1.,  4.,  0.,  ...,  0.,  0.,  0.],
        [ 1., 11.,  1.,  ...,  0.,  0.,  0.],
        ...,
        [ 0.,  0.,  0.,  ...,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  ...,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  ...,  0.,  0.,  0.]])

In [392]:
user_feature

Unnamed: 0,user_nIdx,UserName,Description,DescriptionLen,SpecialChar,UserGender,UserFan,UserFollow,UserWeibo,UserVerified,LoyalFan,VipRank,Vip
0,0,薪火鹏,1.0,17.0,2.0,1.0,30.0,56.0,692.0,0.0,0.0,0.0,0.0
1,1,提尔乌斯,1.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,猫的摇篮-伪物,1.0,11.0,1.0,1.0,917.0,1109.0,2013.0,0.0,0.0,0.0,0.0
3,3,非常神奇的老z,0.0,0.0,0.0,1.0,13.0,59.0,9.0,0.0,0.0,0.0,0.0
4,4,假装很强的萌新,1.0,10.0,0.0,1.0,14.0,34.0,107.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3778,3778,宋宋Monkey,,,,,,,,,,,
3779,3779,Shadow-201201,,,,,,,,,,,
3780,3780,我是你晗大大,,,,,,,,,,,
3781,3781,普莱米亚姆,,,,,,,,,,,


In [393]:
user_feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3783 entries, 0 to 3782
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   user_nIdx       3783 non-null   int64  
 1   UserName        3783 non-null   object 
 2   Description     2803 non-null   float64
 3   DescriptionLen  2803 non-null   float64
 4   SpecialChar     2803 non-null   float64
 5   UserGender      2803 non-null   float64
 6   UserFan         2803 non-null   float64
 7   UserFollow      2803 non-null   float64
 8   UserWeibo       2803 non-null   float64
 9   UserVerified    2803 non-null   float64
 10  LoyalFan        2803 non-null   float64
 11  VipRank         2803 non-null   float64
 12  Vip             2803 non-null   float64
dtypes: float64(11), int64(1), object(1)
memory usage: 384.3+ KB


借助原数据集`user_feature`来确认数字型变量在`user_f`中的对应列索引。该对应关系如下所示（左边来自`user_feature`的列索引，右边则为`user_f`）

Identify the column indices of numeric variables in `user_f` via the dataset `user_feature`. The correspondencs is shown below (the left side is the column index of `user_feature` and the right one is `user_f`)


user_feature v.s. user_f
- 2 - 0
- 3 - 1
- 4 - 2
- 5 - 3
- 6 - 4
- 7 - 5
- 8 - 6
- 9 - 7
- 10 - 8
- 12 - 9

In [394]:
stand_col = [1, 2, 4, 5, 6]
user_stand = user_f[:, stand_col]
user_stand

tensor([[1.7000e+01, 2.0000e+00, 3.0000e+01, 5.6000e+01, 6.9200e+02],
        [4.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [1.1000e+01, 1.0000e+00, 9.1700e+02, 1.1090e+03, 2.0130e+03],
        ...,
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 5.0000e+01, 6.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 7.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 6.1000e+01, 0.0000e+00]])

标准化

Standardization

In [395]:
user_s_mean = user_stand.mean(dim=0)
user_s_std = user_stand.std(dim=0)
stand_user_f = (user_stand - user_s_mean)/user_s_std
stand_user_f

tensor([[ 0.9457,  0.3182, -0.0239, -0.4984, -0.2206],
        [-0.2962, -0.4198, -0.0240, -0.5650, -0.2996],
        [ 0.3725, -0.0508, -0.0209,  0.7533, -0.0697],
        ...,
        [-0.6783, -0.4198, -0.0240, -0.5055, -0.2989],
        [-0.6783, -0.4198, -0.0240, -0.5567, -0.2996],
        [-0.6783, -0.4198, -0.0240, -0.4925, -0.2996]])

In [396]:
s_user_f = user_f.clone()
s_user_f[:,stand_col]=stand_user_f
s_user_f

tensor([[ 1.0000,  0.9457,  0.3182,  ...,  0.0000,  0.0000,  0.0000],
        [ 1.0000, -0.2962, -0.4198,  ...,  0.0000,  0.0000,  0.0000],
        [ 1.0000,  0.3725, -0.0508,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 0.0000, -0.6783, -0.4198,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000, -0.6783, -0.4198,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000, -0.6783, -0.4198,  ...,  0.0000,  0.0000,  0.0000]])

In [397]:
torch.save(s_user_f,'Tensor/s_user_f.pt')

s_user_f将作为我们的模型的输入

s_user_f will be our proposed model's input

# 6. 数据一览/Data overview

在这个部分，你可以查看该jupyter文件的相关的数据文件说明（下列名字为保存时的名字，不是变量名）

In this part, you can check descriptions of data involved in this jupyter file (the following names are names at the time of saving, not the variable names)

1. Datasets (csv or xlsx file)
- Data_SinaWeibo.csv: 
  - The initial dataset obtained from the Python Script
- UserData.csv: Section 2.7
  - The initial user profile datasets with preprocessing (only users with profiles)
- UserProfile.csv: Section 2.7
  - The user profile datasets with necessary features (based on UserData; only users with profiles)
- ForCheck.csv: Section 3.3.2
  - The dataset with "@" in F_Comment
- ManuCode.xlsx: Section 3.5.1
  - The dataset used to manually decode emoji
- ManuCode_Finish.xlsx: Section 3.5.1
  - The dataset completed with manual decoding
- UserInteraction.csv: Section 4.1.1
  - The initial user interactions dataset (all users)
- UserFeature_Graph.csv: Section 4.1.4 (2)
  - The final user profile datasets (all users)
- UserInter_Graph.csv: Section 4.1.5 (4)
  - The user interaction dataset (all users; only includes username and user node index)
- Sina_allInfo.csv: Section 4.2.1
  - The dataset with all interaction information
- UserPost.csv: Section 4.2.3 (2)
  - The dataset of user-comment correspondences
  
2. Tensor (pt file)
- user_f: Section 4.1.4 (2)
  - User features for all users (obtained from UserFeature_Graph)
- user_embed_idx: Section 4.1.4 (4)
  - Users embedding indices (only users without profiles)
- userInter_data: Section 4.1.5 (6)
  - User interactions (identified by user node indices)
- userPost_comm: Section 4.2.3 (4)
  - User-comment correspondences (identified by user node indices and comment ones)
- Comment_Vector: Section 4.2.4 (2)
  - Text vector representation for all comments (obtained from bert-base-chinese)
- userInter_self: Section 5.1.1
  - User interactions with self-loop (identified by user node indices; obtained from userInter_data)
- HeteroData: Section 5.1.2
  - HeteroG
- s_user_f: Section 5.2
  - Standardization of numeric variables in user_f (all users; obtained from user_f)