## 项目题目：基于特征工程的微博事件热度预测

### 摘要
    对抓取的微博事件数据从信息传播、发帖用户、博文内容三方面提取特征，利用多元回归简单分析影响微博热度的因素，利用随机森林算法，训练模型，预测微博事件热度。

### 引言

#### 问题提出
      随着Facebook、微博等各大社交平台的广泛应用，以文本、图片、视频的形式发布高时效性、短文本内容已经成为信息传播的重要方式。伴随信息量爆炸性增长、用户高活跃度而来的是信息过载、虚假信息泛滥、舆情管控滞后等问题。对于以上问题解决的需求及对信息热度的关注，使得社交平台中信息热度的预测成为部分学者较为关注的点。
      热度是指信息发布后受大众欢迎的程度，经一段时间传播后所呈现的数值表征，研究人员通常倾向于关注“转发量”、“评论量”和“点赞量”，结合“传播深度”、发布的“博文条数”和时间等相关因素实现对热度的计算与预测[1]。比如，陈梦秋[2] 等人先利用网页排名（PageRank, PR）算法计算出博主的用户影响力，将博主用户影响力与博主最近微博热度、博文是否原创、是否含有标签等作为特征，对博文热度进行预测[2]。谭炎从微博话题传播动态特征、用户影响力、话题内容、微博情感方面构造特征，利用分类模型预测微博话题流行趋势[3]。H Zhu等在热度预测中提出传播加速度的新特征，结合传播加速度、信息发布初期的热度和用户活跃度完成特征模型的构建[4]。

#### 研究思路
     对数据了解并导入⮕数据处理（去除缺失值，去重等）⮕从博文传播特征（转发、点赞、评论），发帖用户特征（粉丝数、关注数、地域），博文内容特征（情感，长度，@和#数量，原创性）三方面构建特征⮕多元回归，分析影响因素⮕训练模型（随机森林）预测事件热度⮕评估模型

#### 假设或问题：
    1.博文的传播力越大，事件热度越高。
    2.发帖用户的粉丝数、关注数越多，事件热度越高。
    3.博文内容的@和#数量越多，非原创率越高，事件热度越高。
    4.博文的积极情感占比对事件热度产生正向影响。
    5.博文越短，事件热度越高。
    6.不同地域的人是否有主题偏好？

### 研究方法
    线性回归、随机森林、机器学习

In [1]:
import os
import urllib3
import json
import warnings
import ast
import re
import pandas as pd
import pandas as pdchangjiantou
from snownlp import SnowNLP
from tqdm.notebook import tqdm
from multiprocessing import Pool
import statsmodels.api as sm
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [2]:
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [3]:
# 导入数据与初步数据处理
class HandleData:
    def __init__(self,folder_path):
        self.folder_path = folder_path
        pass
    
    def get_file_paths(self):
        file_paths = []
        for filename in os.listdir(self.folder_path):
            file_path = os.path.join(self.folder_path,filename)
            if os.path.exists(file_path):
                file_paths.append(file_path)
        return file_paths

    def handle_excel_data(self,file_paths):
        data_frames = {}
        for item in file_paths:
            try:
                df = pd.read_excel(item, nrows=100)
                df.dropna(subset=['全文内容'], inplace=True)
                df.drop_duplicates(subset=['全文内容'], inplace=True)
                get_name = os.path.basename(item)    
                data_frames[get_name] = df
            except Exception as e:
                print(f"Error processing file: {item}")
                print(f"Error message: {e}")
                get_name = os.path.basename(item)   
                with open(r'C:\Users\lenovo\Desktop\需要重新保存的excel文件.txt','a') as file:
                    file.write( get_name + "\n")
                continue
        return data_frames

    def main(self):
        file_paths = self.get_file_paths()
        data_frames = self.handle_excel_data(file_paths)
        return data_frames
    
folder_path =r"C:\Users\lenovo\Desktop\微博热度预测\选题二训练集：热点事件的发展趋势预测\数据"     
hander = HandleData(folder_path)
data_frames = hander.main()

In [4]:
len(data_frames)

3920

In [5]:
df = data_frames["1.xlsx"]
df[:5]

Unnamed: 0,检索ID,标题/微博内容,全文内容,点赞,转发,评论,账号昵称UID加密,粉丝数,关注数,地域
0,1,//@X玖少年团肖战DAYTOY:期待冬奥赛场上的那抹中国红！奥运健儿，加油！#今日立春冬奥开幕#,//@X玖少年团肖战DAYTOY:期待冬奥赛场上的那抹中国红！奥运健儿，加油！#今日立春冬奥...,0,0,0,69f298cda961436a19a55ade124c9a14,1373,1571,广东
1,1,恭喜@小呆子918 1名用户获得【测试一下】。C官方唯一抽奖工具@C抽奖平台 对本次抽奖进行...,恭喜@小呆子918 1名用户获得【测试一下】。C官方唯一抽奖工具@C抽奖平台 对本次抽奖进行...,0,0,0,7a37d22725d2a807d0515ad17f4c59d1,774,1293,北京
2,1,转发C,转发C【原C】【#冬奥来啦#—#时代少年团最美中国画#】 “喝中国茶，看中国画”由@时代少年...,0,0,0,cb45856f3c39c6e6c574f61575575913,0,109,浙江
3,1,//@我应该是没机会吧_:一起为奥运健儿加油[赢牛奶]//@TNT时代少年团后援会官博:#时...,//@我应该是没机会吧_:一起为奥运健儿加油[赢牛奶]//@TNT时代少年团后援会官博:#时...,0,0,0,07673de0c35100bec7cd94233de7cc16,33,136,香港
4,1,[开学季]//@-小丸不能掉队-://@-愛意寄予日落:😡//@-愛意寄予日落://@文轩星...,[开学季]//@-小丸不能掉队-://@-愛意寄予日落:😡//@-愛意寄予日落://@文轩星...,0,0,0,a3fabba3c22581c93ae75712dace4efa,641,36,江苏


In [6]:
# 导入并处理《数据集序号和事件对应表》
data_frames_key = list(data_frames.keys())
filter_list = []
for item in data_frames_key:
    try:
        i = int(item.split('.')[0])
        filter_list.append(i)
    except Exception as e:
        print(f"Error processing file: {item}")
        print(f"Error message: {e}")

orderandthing = pd.read_excel(r"C:\Users\lenovo\Desktop\微博热度预测\选题二训练集：热点事件的发展趋势预测\数据集序号和事件对应表.xlsx")
orderandthing = orderandthing[orderandthing['序号'].isin(filter_list)]
orderandthing[:5]

Unnamed: 0,序号,事件,热度,起始时间
0,1,北京举办冬奥会冬残奥会,62.84,2022-01-26
1,2,2022年全国高考,54.37,2022-05-14
2,3,国内成品油价格已进行22轮调整,46.03,2022-01-01
3,4,中国空间站建造任务稳步推进,42.78,2022-01-01
4,5,2022年春晚,32.03,2022-01-01


### 1.博文传播特征

In [7]:
# 博文传播力：单条博文传播力 = 点赞数 + 转发数 +评论数   事件的博文传播力 = 该事件下全部单条博文传播力的均值
order_list = orderandthing['序号'].tolist()
average_communications = []
for order in order_list:
    DataFrame_name = f"{order}.xlsx"
    df = data_frames[DataFrame_name ]
    total_communication = df[['点赞', '转发', '评论']].fillna(0).sum(axis=1)
    average_communication = total_communication.mean()
    average_communications.append(average_communication)
orderandthing['博文传播力'] = average_communications

### 2.发帖用户特征

In [8]:
# 粉丝数、关注数 = 该事件全部博文发帖用户粉丝数、关注数的均值。
order_list = orderandthing['序号'].tolist()
average_fans = []
average_follows = []
for order in order_list:
    DataFrame_name = f"{order}.xlsx"
    df = data_frames[DataFrame_name ]
    df["粉丝数"] = pd.to_numeric(df["粉丝数"], errors='coerce').fillna(0).astype(int)
    df["关注数"] = pd.to_numeric(df["关注数"], errors='coerce').fillna(0).astype(int)
    average_fan =  df["粉丝数"] .mean()
    average_follow =  df["关注数"].mean()
    average_fans.append(average_fan)
    average_follows.append(average_follow)
orderandthing['粉丝数'] = average_fans
orderandthing['关注数'] = average_follows

In [9]:
# 用户地域分布，统计事件下用户的微博ip,省、直辖市、特别行政区和海外出现的频次。
order_list = orderandthing['序号'].tolist()
with open(r"C:\Users\lenovo\Desktop\微博热度预测\中国省份.json", "r", encoding="utf-8") as file:
    provinces = json.load(file)
for province in provinces:
    counts = []
    for order in order_list:
        DataFrame_name = f"{order}.xlsx"
        df = data_frames[DataFrame_name ]
        region_counts = df['地域'].value_counts()
        count = region_counts.get(province, 0)
        counts.append(count)
    orderandthing[province] = counts

### 3.博文内容特征

In [10]:
# @数、#数 = 该事件全部微博的'标题/微博内容'中@、#数量的均值。
# 非原创率 = 该事件全部微博的'标题/微博内容'列中含有“转发”的博文占总博文的比例。
order_list = orderandthing['序号'].tolist()
mention_counts = []
topic_counts = []
no_originals = []
for order in order_list:
    DataFrame_name = f"{order}.xlsx"
    df = data_frames[DataFrame_name ]
    df.dropna(subset=['标题/微博内容'], inplace=True)
    df['标题/微博内容'] = df['标题/微博内容'].astype(str)
    average_topic_count = df['标题/微博内容'].apply(lambda x: x.count('#')).mean()
    average_mention_count = df['标题/微博内容'].apply(lambda x: x.count('@')).mean()
    # 非原创率
    average_no_original = df['标题/微博内容'].str.contains('转发').sum() /len(df)
    mention_counts.append(average_mention_count)
    topic_counts.append(average_topic_count)
    no_originals.append(average_no_original)
    
orderandthing['@数'] = mention_counts
orderandthing['#数'] = mention_counts
orderandthing['非原创率'] = no_originals

In [11]:
# 平均博文长度 = 该事件全部微博“全文内容”的长度的均值。
# 长博文比率 = 该事件全部微博“全文内容”大于等于140个字符的博文数量占总博文数量的比例。
# 长博文比率 = 该事件全部微博“全文内容”小于140个字符的博文数量占总博文数量的比例。
# 长博文/短博文 = 该事件全部微博，长博文的数量 +1 / 短博文的数量 +1
order_list = orderandthing['序号'].tolist()
average_lengths = []
long_ratios = []
short_ratios = []
long_short_ratios = []
for order in tqdm(order_list, desc="Processing orders", total=len(order_list)):
    DataFrame_name = f"{order}.xlsx"
    df = data_frames[DataFrame_name ]
    df.dropna(subset=['标题/微博内容'], inplace=True)
    df['标题/微博内容'] = df['标题/微博内容'].astype(str)
    
    lengths = []
    long_count = 0
    short_count = 0
    for text in df['标题/微博内容']:
        #text = re.sub(r'(?:回复)?(?://)?@[\w\u2E80-\u9FFF]+:?|\[\w+\]', ',',text)
        text_length = len(text)
        lengths.append(text_length)
        if text_length >= 140:
            long_count += 1
        elif text_length < 140:
            short_count += 1

    average_length = sum(lengths) / len(lengths)
    average_lengths.append(average_length)
    
    long_ratio = long_count / len(df)
    short_ratio = short_count / len(df)
    
    long_short_ratio = (long_count + 1) / (short_count + 1)
    
    long_ratios.append(long_ratio)
    short_ratios.append(short_ratio )
    long_short_ratios.append(long_short_ratio)
    
orderandthing['平均博文长度'] = average_lengths
orderandthing['长博文比率'] = long_ratios 
orderandthing['短博文比率'] = short_ratios
orderandthing['长博文/短博文'] = long_short_ratios

Processing orders:   0%|          | 0/3917 [00:00<?, ?it/s]

In [None]:
# 用SnowNLP计算情感值。
# 平均情感值 = 该事件全部微博的“标题/微博内容”文本情感值的均值。
# 消极情感比率 = 该事件全部微博的“标题/微博内容”文本情感值 <0.5 的博文数量占总博文数量的比例。
# 积极情感比率 = 该事件全部微博的“标题/微博内容”文本情感值 >=0.5 的博文数量占总博文数量的比例。
# 积极/消极 = 该事件全部微博的（积极情感的博文数量 +1）/ （消极情感的博文数量 +1）
order_list = orderandthing['序号'].tolist()
dict_sentiments = {}
for order in tqdm(order_list, desc="Processing orders", total=len(order_list)):
    DataFrame_name = f"{order}.xlsx"
    df = data_frames[DataFrame_name ]
    df.dropna(subset=['标题/微博内容'], inplace=True)
    df['标题/微博内容'] = df['标题/微博内容'].astype(str)
    sentiment = []
    for text in df['标题/微博内容']:
        text = re.sub(r'(?:回复)?(?://)?@[\w\u2E80-\u9FFF]+:?|\[\w+\]', ',',text)
        s = SnowNLP(text)
        sentiment.append(s.sentiments)
    dict_sentiments[ DataFrame_name] = sentiment 
       
data= [{'文件名': key, '情感值': value} for key, value in dict_sentiments.items()]
df_sentiments = pd.DataFrame(data)
output_file = r"C:\Users\lenovo\Desktop\微博热度预测\sentiments.xlsx"
df_sentiments.to_excel(output_file, index=False)

In [12]:
df_sentiments = pd.read_excel(r"C:\Users\lenovo\Desktop\微博热度预测\sentiments.xlsx")
df_sentiments['情感值'] = df_sentiments['情感值'].apply(ast.literal_eval)

df_sentiments['平均情感值'] = df_sentiments['情感值'].apply(lambda x: sum(x) / len(x))
df_sentiments['积极情感计数'] = df_sentiments['情感值'].apply(lambda x: sum(1 for emotion in x if emotion >= 0.5))
df_sentiments['消极情感计数'] = df_sentiments['情感值'].apply(lambda x: sum(1 for emotion in x if emotion < 0.5))

df_sentiments['消极情感比率'] = df_sentiments['消极情感计数'] / len(df_sentiments['情感值'])
df_sentiments['积极情感比率'] = df_sentiments['积极情感计数'] / len(df_sentiments['情感值'])
df_sentiments['积极/消极']  = (df_sentiments['积极情感计数'] + 1) / (df_sentiments['消极情感计数'] + 1)

# 将df_sentiments中的列合并到orderandthing数据框中。
df_sentiments['序号'] = df_sentiments['文件名'].apply(lambda x: int(x.replace('.xlsx', '')))
orderandthing = orderandthing.merge(df_sentiments[['序号', '平均情感值', '消极情感比率', '积极情感比率','积极/消极']], on='序号', how='left')

In [13]:
orderandthing[:5]

Unnamed: 0,序号,事件,热度,起始时间,博文传播力,粉丝数,关注数,北京,天津,河北,...,#数,非原创率,平均博文长度,长博文比率,短博文比率,长博文/短博文,平均情感值,消极情感比率,积极情感比率,积极/消极
0,1,北京举办冬奥会冬残奥会,62.84,2022-01-26,0.05,5782.7,299.27,15,1,6,...,1.37,0.18,78.81,0.05,0.95,0.0625,0.685451,0.007659,0.017871,2.290323
1,2,2022年全国高考,54.37,2022-05-14,0.27,3470.29,463.81,9,0,4,...,0.6,0.46,40.13,0.08,0.92,0.096774,0.465057,0.014807,0.010722,0.728814
2,3,国内成品油价格已进行22轮调整,46.03,2022-01-01,0.21,14637.5,661.69,15,2,2,...,0.48,0.48,58.88,0.08,0.92,0.096774,0.389349,0.016594,0.008935,0.545455
3,4,中国空间站建造任务稳步推进,42.78,2022-01-01,0.06,12071.62,783.23,12,2,0,...,0.4,0.56,30.3,0.02,0.98,0.030303,0.371038,0.01736,0.00817,0.478261
4,5,2022年春晚,32.03,2022-01-01,1.08,25232.69,463.03,22,2,2,...,1.04,0.21,61.45,0.16,0.84,0.2,0.392202,0.015828,0.009701,0.619048


###  多元回归

In [14]:
X = orderandthing[['博文传播力', '粉丝数', '关注数', '@数', '#数', '非原创率', '平均博文长度', '长博文比率', '短博文比率', '长博文/短博文',
                  '平均情感值','消极情感比率','积极情感比率','积极/消极']]
y = orderandthing['热度']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
model.summary()

0,1,2,3
Dep. Variable:,热度,R-squared:,0.033
Model:,OLS,Adj. R-squared:,0.03
Method:,Least Squares,F-statistic:,11.23
Date:,"Fri, 09 Feb 2024",Prob (F-statistic):,1.83e-22
Time:,20:20:27,Log-Likelihood:,-8308.6
No. Observations:,3917,AIC:,16640.0
Df Residuals:,3904,BIC:,16720.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.2994,0.443,-0.676,0.499,-1.168,0.569
博文传播力,0.0026,0.010,0.267,0.789,-0.016,0.022
粉丝数,-5.793e-07,3.11e-07,-1.862,0.063,-1.19e-06,3.07e-08
关注数,-0.0007,0.000,-5.396,0.000,-0.001,-0.000
@数,0.1087,0.034,3.163,0.002,0.041,0.176
#数,0.1087,0.034,3.163,0.002,0.041,0.176
非原创率,2.4966,0.351,7.117,0.000,1.809,3.184
平均博文长度,9.05e-05,8.86e-05,1.021,0.307,-8.32e-05,0.000
长博文比率,0.0828,0.317,0.261,0.794,-0.538,0.704

0,1,2,3
Omnibus:,8552.498,Durbin-Watson:,0.068
Prob(Omnibus):,0.0,Jarque-Bera (JB):,38595782.669
Skew:,19.629,Prob(JB):,0.0
Kurtosis:,487.707,Cond. No.,8.44e+23


### 预测热度（随机森林）

In [15]:
X = orderandthing[['博文传播力', '粉丝数', '关注数', '@数', '#数', '非原创率', '平均博文长度', '长博文比率', '短博文比率', '长博文/短博文',
                  '平均情感值','消极情感比率','积极情感比率','积极/消极']]
y = orderandthing['热度']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(n_estimators=300, random_state=42)  
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print("Mean Absolute Error:", mae)
print("Root Mean Squared Error:", rmse)

Mean Absolute Error: 0.5827045068027211
Root Mean Squared Error: 1.3733645400706505


### 研究发现
    1.博文传播力与事件热度不相关。
    2.发帖用户的粉丝数、关注数对事件热度产生负向影响。推翻假设2。
    3.博文内容中含有@、#数量越多，非原创率越高，事件热度越高。论证假设3。
    4.博文的消极情感比率与事件热度不相关，积极情感比率越高，事件热度越高。论证假设4。
    5.博文的长度特征和事件热度没有显著的相关关系。

### 结论
      本研究对微博事件数据进行了特征构建，利用线性回归简单分析了影响事件热度的因素，并利用随机森林算法，训练预测微博事件热度模型，并对性能做了评估。总体来看较为完善，但是存在诸多不足之处。
      1.因为在利用SnowNLP算文本情感值时，速度过慢，时间原因，只读取了数据中excel表的前100个，总数据量只有40万左右。根据测试，如果数据规模扩大，多元回归的显著性结果会发生变化，预测模型的偏差也会减少。另外，没有对SnowNLP进行微博文本的专门训练，使用的是开发者训练好的，在运用到微博文本时，可能计算的情感值会有偏差。
      2.特征提取的方法、分析影响因素的方法等不够严谨。因数据量等原因，训练的预测模型的性能不太高。
      3.最后一个问题，不同地域的人是否对微博事件有主题偏好？，暂时没想清楚怎么做，只简单统计了一下用户地域频次。