# Functions and template

In [2]:
ZHIPUAI_API_KEY = "your zhipu ai api key"

In [1]:
import re
import json
from zhipuai import ZhipuAI
from langchain.prompts import PromptTemplate
import pandas as pd

In [3]:
def zhipu_chat_complete(prompt, model = "glm-4-air"):
    """
    使用智谱的glm模型进行对话"""
    # 初始化智谱的API客户端
    client = ZhipuAI(api_key=ZHIPUAI_API_KEY)
    # 调用智谱的API进行对话
    response = client.chat.completions.create(
        model = model,
        messages=[
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature=0.1,
        max_tokens=1024
    )
    # 返回智谱的API的响应结果
    answer = response.choices[0].message.content
    return answer

In [4]:
def extract_sentiments_json(text):
    # Regular expression to find JSON object
    json_pattern = r'\{.*?\}'
    # Find all JSON objects in the text
    json_matches = re.findall(json_pattern, text, re.DOTALL)
    # Parse the JSON strings
    try:
        json_data = json.loads(json_matches[0])
        if ("positive" in json_data) and ("negative" in json_data) and ("neutral" in json_data):
            return json_matches[0]
        else:
            return None
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")
        return None

In [39]:
prompt_template_string = """
Consider the message on a listing firm, please extract the sentiments (hold, down, neutral) distribution (i.e., probabilities of its sentiments) from the message.\ 
The sum of probabilities of sentiments must be equal to 1.

The message:
{message}

Workflow: 
- First step, please pay special attention to the following points one by one, and at the same time evaluate the sentiments distribution of the message.
Attention 1: Please pay special attention to any irrealis mood used.
Attention 2: Please pay special attention to any rhetorics (sarcasm, negative assertion, etc.) used.
Attention 3: Please focus on the speaker sentiment, not a third party.
Attention 4: Please focus on the stock ticker/tag/topic, not other entities.
Attention 5: Please pay special attention to the time expressions, prices, and other unsaid facts.
- Second step, Based on the initial evaluations in the first step, make the final overall evaluation of the sentiments (positive, negative, neutral) distribution.

Output requirements:
- Output the sentiments distribution in the second step strictly following JSON format with keys of positive, negative, and neutral.
- The sum of probabilities of sentiments must be equal to 1. 
"""

# test

In [60]:
prompt_template = PromptTemplate(template= prompt_template_string, input_variables=["message"])
message = "可怜的投资者往往还在期待像过去那样上天，有机会不逃跑可能就是上西天[微笑]，真正好的你发现不了，你拿到的往往就是地雷"
prompt = prompt_template.format(message = message)

In [61]:
print(prompt)


Consider the message on a listing firm, please extract the sentiments (hold, down, neutral) distribution (i.e., probabilities of its sentiments) from the message.\ 
The sum of probabilities of sentiments must be equal to 1.

The message:
可怜的投资者往往还在期待像过去那样上天，有机会不逃跑可能就是上西天[微笑]，真正好的你发现不了，你拿到的往往就是地雷

Workflow: 
- First step, please pay special attention to the following points one by one, and at the same time evaluate the sentiments distribution of the message.
Attention 1: Please pay special attention to any irrealis mood used.
Attention 2: Please pay special attention to any rhetorics (sarcasm, negative assertion, etc.) used.
Attention 3: Please focus on the speaker sentiment, not a third party.
Attention 4: Please focus on the stock ticker/tag/topic, not other entities.
Attention 5: Please pay special attention to the time expressions, prices, and other unsaid facts.
- Second step, Based on the initial evaluations in the first step, make the final overall evaluation of the sentiments (

In [62]:
p = zhipu_chat_complete(prompt=prompt, model = "GLM-4-Air")

In [63]:
print(p)

To analyze the given message and extract the sentiment distribution, let's break down the message according to the attention points provided:

Message: "可怜的投资者往往还在期待像过去那样上天，有机会不逃跑可能就是上西天[微笑]，真正好的你发现不了，你拿到的往往就是地雷"

Translation: "Poor investors often still expect to soar to the heavens as in the past; if they don't escape at the opportunity, it might be going to the other world [smile]. You can't find the truly good ones, what you often end up with is a landmine."

Attention 1: The irrealis mood is present in the phrase "if they don't escape at the opportunity, it might be going to the other world," which implies a hypothetical scenario with a negative outcome.

Attention 2: The use of "上西天" (going to the other world) with the smiley emoji "[微笑]" can be interpreted as sarcasm, indicating a negative sentiment. The phrase "what you often end up with is a landmine" is a negative assertion about the outcomes for investors.

Attention 3: The speaker seems to express a negative sentiment towar

In [46]:
p_d = json.loads(extract_sentiments_json(p))

In [47]:
p_d

{'positive': 0.1, 'negative': 0.8, 'neutral': 0.1}

In [48]:
def label_sentiment(message):
    prompt_template = PromptTemplate(template= prompt_template_string, input_variables=["message"])
    prompt = prompt_template.format(message = message)
    p = zhipu_chat_complete(prompt = prompt, model = "GLM-4-Air")
    sentiment_dict = json.loads(extract_sentiments_json(p))
    # 获取情感倾向的最大值
    max_sentiment = max(sentiment_dict.values())
    # 判断情感倾向并返回相应的值
    if sentiment_dict['positive'] == max_sentiment:
        return 1
    elif sentiment_dict['negative'] == max_sentiment:
        return -1
    else:
        return 0

In [49]:
label_sentiment(message)

-1

# data

In [14]:
# 读取CSV文件
df = pd.read_csv('labeldata.csv')
# 查看数据的前几行
df.head()

Unnamed: 0,用户股吧年龄,用户昵称,个股名称,评论标题,评论内容,发帖时间,标签,label
0,5.8年,星辰大海12639,浦发银行(600000.sh),业绩不错，为啥跌的如此厉害亏大了,业绩不错，为啥跌的如此厉害[亏大了],2019/10/30 13:19,负向,-1
1,8.8年,高天运,浦发银行(600000.sh),愉快,愉快,2016/7/28 17:49,正向,1
2,6.9年,面包二米粥,浦发银行(600000.sh),浦发是庄股了！每天成交量好少！,浦发是庄股了！每天成交量好少！,2016/12/4 9:33,负向,-1
3,10.3年,股友O9700I7223,浦发银行(600000.sh),有没有人想过，过了3300，赚一笔就跑的,是不是和2015年6月份时候，认为大盘5100还要上攻，是同样的死法？ 历史总是惊人的相同[大笑],2017/4/19 10:30,中性,0
4,7年,股市扁鹊8956,浦发银行(600000.sh),开盘5分钟,1：股指期货：翻红上冲中，较好。 2：沪深两市：两市纷纷高开，主板和创业板暂时持平，盘中预计...,2016/2/16 9:53,中性,0


In [15]:
df.shape

(12000, 8)

In [75]:
sampled_data = df.sample(n=50)

In [76]:
sampled_data

Unnamed: 0,用户股吧年龄,用户昵称,个股名称,评论标题,评论内容,发帖时间,标签,label
3985,6.4年,匈奴女人,海尔智家(600690.sh),明天满上！,明天满上！,2018/5/27 21:37,正向,1
3053,3.9年,神人昶,复星医药(600196.sh),！这是要跌停卧槽！,！这是要跌停?卧槽！,2019/1/2 10:49,负向,-1
7275,6.7年,华尔街巨富,中国石油(601857.sh),关注万达电影，今天有好戏！开盘进入！,关注万达电影，今天有好戏！开盘进入！,2017/6/29 7:25,正向,1
7302,2.7年,真的土猩猩,中国石油(601857.sh),2点了，等待万手哥出现,2点了，等待万手哥出现,2019/9/23 14:01,中性,0
4021,7.3年,湘水稳涨,海尔智家(600690.sh),吵什么，这肯定利空呀总不会是利好吧，影响多大星期一收盘就知道了，不说了，,吵什么，这肯定利空呀总不会是利好吧，影响多大星期一收盘就知道了，[不说了]，,2018/5/27 5:10,中性,0
9065,6.8年,民间炒股者,兴业银行(601166.sh),后市冲高后仍有振荡,后市冲高后还有振荡,2016/6/28 17:15,正向,1
9153,12.7年,暗色调252277,兴业银行(601166.sh),福成股份：收购牧场剥离冗余陵园业务前景广阔,今晚消息汇总,2017/7/29 23:09,中性,0
1010,3.7年,爱德华爱投资,上汽集团(600104.sh),手握千亿未分配利润，这可能才是近期杀入的大户的决心吧！周期股就是得耐心的等。等到,手握千亿未分配利润，这可能才是近期杀入的大户的决心吧！周期股就是得耐心的等。等到经济起缓，世...,2019/5/27 14:29,正向,1
4635,7.8年,elliesa88,中国中免(601888.sh),涨两天跌半个月,涨两天跌半个月,2018/6/13 17:52,负向,-1
10802,3年,股友E9mu9l,三安光电(600703.sh),快点跌，别犹豫使劲砸。,快点跌，别犹豫使劲砸。,2019/7/4 10:16,负向,-1


In [77]:
predicted = []
labels = []

In [78]:
# 循环获取DataFrame中的数据
import time
total_rows = sampled_data.shape[0]
count = 0
for index, row in sampled_data.iterrows():
    # 读取Name列的内容
    message = row['评论内容']
    # 进行处理
    print(f"message: {message}")
    #time.sleep(1)
    try:
        predicted_label = label_sentiment(message=message)
        print(f"predicted_label: {predicted_label}")
        predicted.append(predicted_label)
        labels.append(row["label"])
        print("acctual label: " + str(row["label"]))
        
        # 打印进度
        current_progress = (count + 1) / total_rows * 100
        count += 1
        print(f"Progress: {current_progress:.2f}%")
    except:
        continue

message: 明天满上！
predicted_label: 1
acctual label: 1
Progress: 2.00%
message: ！这是要跌停?卧槽！
predicted_label: -1
acctual label: -1
Progress: 4.00%
message: 关注万达电影，今天有好戏！开盘进入！
predicted_label: 1
acctual label: 1
Progress: 6.00%
message: 2点了，等待万手哥出现
predicted_label: 0
acctual label: 0
Progress: 8.00%
message: 吵什么，这肯定利空呀总不会是利好吧，影响多大星期一收盘就知道了，[不说了]，
predicted_label: -1
acctual label: 0
Progress: 10.00%
message: 后市冲高后还有振荡
predicted_label: -1
acctual label: 1
Progress: 12.00%
message: 今晚消息汇总
predicted_label: 1
acctual label: 0
Progress: 14.00%
message: 手握千亿未分配利润，这可能才是近期杀入的大户的决心吧！周期股就是得耐心的等。等到经济起缓，世界局势平稳，到时候肯定是一飞冲天啊！
predicted_label: 1
acctual label: 1
Progress: 16.00%
message: 涨两天跌半个月
Error decoding JSON: Expecting property name enclosed in double quotes: line 2 column 21 (char 22)
message: 快点跌，别犹豫使劲砸。
predicted_label: -1
acctual label: -1
Progress: 18.00%
message: 中信你等着包销邮储银行那些弃购，看看你年底业绩如何打脸
predicted_label: -1
acctual label: -1
Progress: 20.00%
message: 下周还有动力
predicted_label: 0
acctual label: 1

In [79]:
len(labels)

49

In [80]:
len(predicted)

49

In [81]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score

# 假设labels是真实标签，predicted_labels是预测结果
labels = labels  # 真实标签
predicted_labels = predicted  # 预测结果

#labels = new_labels  # 真实标签
#predicted_labels = new_predicted  # 预测结果

# 计算准确率
accuracy = accuracy_score(labels, predicted_labels)

# 计算精确度
precision = precision_score(labels, predicted_labels, average='weighted')

# 计算召回率
recall = recall_score(labels, predicted_labels, average='weighted')

# 计算F1分数
f1 = f1_score(labels, predicted_labels, average='weighted')

# 计算混淆矩阵
conf_matrix = confusion_matrix(labels, predicted_labels)

# 计算ROC AUC分数（如果模型输出概率）
# roc_auc = roc_auc_score(labels, predicted_labels, multi_class='ovr')

# 打印结果
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix: {conf_matrix}")
# print(f"ROC AUC: {roc_auc}")

Accuracy: 0.6938775510204082
Precision: 0.6862244897959183
Recall: 0.6938775510204082
F1 Score: 0.6825188547179669
Confusion Matrix: [[17  2  0]
 [ 3  5  4]
 [ 4  2 12]]
