# 数据加载

- 数据可从 https://guba.eastmoney.com/list,601901_1.html 爬取
- 本项目的数据并非最新，为2023年12月左右的评论数据

In [34]:
import pandas as pd
comment = pd.read_excel('D:/model/web/nlp04/data/股民评论数据.xlsx')

In [35]:
comment.head()

Unnamed: 0,title,time,read,reply,name_text
0,“人齐了”再强化！方正研究所再添副所长 原德邦证券芦哲加盟出任 身兼首席宏观经济学家,2023-12-01 12:15:00,2020,14,方正证券资讯
1,看走势方正本周末不会停牌，其实晚点停更好，因停牌前20日涨幅不能超过20%，往后,2023-12-01 03:19:00,231,3,越是真理越简单
2,图一是日线图，绿圈位置是日线死叉一个日线级别卖点。图二是一个30分图。最低杀到8,2023-11-30 05:25:00,7665,396,股怪不在怪
3,大家散了吧，把重组什么的都忘了了吧，周一历史高位快跑吧。,2023-12-01 03:16:00,49,1,Aishad
4,注销东方财富，换成通达信，这里叽叽喳喳太恶心了,2023-12-01 03:32:00,12,0,微笑若成风


# 数据预处理

In [44]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [39]:
model_path = 'D:/model/web/nlp04/new_model'

In [42]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [43]:
tokenizer 

BertTokenizerFast(name_or_path='D:/model/web/nlp04/new_model', vocab_size=21128, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [40]:
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels = 3)

In [41]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

# 使用pipeline快速实现情感分类

In [57]:
from transformers import pipeline

In [58]:
classifier = pipeline('sentiment-analysis', 
                      model = model,
                      tokenizer = tokenizer,
                      max_length = 512,
                      truncation = True,
                      padding = True,
                      device = 'cuda'
                     )

In [59]:
classifier(list(comment['title'])[0])

[{'label': 'Neutral', 'score': 0.9998457431793213}]

In [68]:
def result_output(text = list(comment['title']), classifier = classifier):
    Label, Score = [], []
    for i in range(len(text)):
        label = classifier(text[i])[0]['label']
        score = classifier(text[i])[0]['score']
        Label.append(label)
        Score.append(score)
        
    df = pd.DataFrame({
        'comment':text,
        'label':Label,
        'score': Score,
    })
    df['label'] = df['label'].map({
        'Neutral': '中性',
        'Negative': '消极',
        'Positive': '积极'
    })    
    return df

In [69]:
result_output(list(comment['title'])[:10], classifier)

Unnamed: 0,comment,label,score
0,“人齐了”再强化！方正研究所再添副所长 原德邦证券芦哲加盟出任 身兼首席宏观经济学家,中性,0.999846
1,看走势方正本周末不会停牌，其实晚点停更好，因停牌前20日涨幅不能超过20%，往后,中性,0.999514
2,图一是日线图，绿圈位置是日线死叉一个日线级别卖点。图二是一个30分图。最低杀到8,中性,0.999524
3,大家散了吧，把重组什么的都忘了了吧，周一历史高位快跑吧。,中性,0.999843
4,注销东方财富，换成通达信，这里叽叽喳喳太恶心了,消极,0.999705
5,周一看涨跌,消极,0.874752
6,涨个几分钱可真费劲啊[滴汗][滴汗][滴汗],积极,0.995598
7,又套一批，周一快跑吧,中性,0.999845
8,周线像是吊颈线，今天把本周补的仓位全部T出去了，等待下周验证,中性,0.99978
9,感觉信达还在减持,中性,0.998025
