# Directly Matching Method (Brute Force) 

In [1]:
import pandas as pd
import re
import json
from concurrent.futures import ThreadPoolExecutor
import time


## Attempt to run using a small dataset of 1000 entries. (Regardless of the bankrupt companies)

The task involves applying a Bert transformation on a small dataset containing 1000 entries, including data from both solvent and bankrupt companies. This process will involve encoding the names of these companies using the Bert model, effectively converting each name into a unique vector representation. The objective is to create a comprehensive vector matrix that represents all companies in the dataset, irrespective of their financial status, providing a nuanced understanding of the dataset through advanced natural language processing techniques.

In [30]:
import pandas as pd
import json
import re

# Load news data
news_data = pd.read_excel('News_1000.xlsx')  # Load Excel file
news_data['NewsContent'].fillna('', inplace=True)

# Ensure the Title column is of string type
news_data['Title'] = news_data['Title'].astype(str)

# Clean the Title column
news_data['Title'] = news_data['Title'].apply(lambda x: re.sub(r'S\*?ST', '', x))  

# Load the list of companies
with open('A_share_list.json', 'r', encoding='utf-8') as file:
    a_share_list = json.load(file)

# Create a list of company names
name_list = [item['name'] for item in a_share_list] + [item['fullname'] for item in a_share_list]

# Compile regular expression
pattern = re.compile('|'.join(map(re.escape, name_list)), flags=re.IGNORECASE)

# Extract relevant company names
news_data['Explicit_Company'] = news_data['NewsContent'].apply(
    lambda content: ', '.join(set(match for match in pattern.findall(content)))
)

# Filter out entries with empty Explicit_Company
filtered_news_data = news_data[news_data['Explicit_Company'] != '']

# Retain only NewsID, NewsContent, and Explicit_Company columns
result_news_data = filtered_news_data[['NewsID', 'NewsContent', 'Explicit_Company']]

# Save the processed data
# result_news_data.to_excel('result_news_data.xlsx', index=False)


## Run on the complete dataset (Considering the bankrupt companies)

This part processes news data from an Excel file, focusing on extracting and identifying explicit mentions of A-share companies. It cleans and standardizes titles and company names, then uses regular expressions to detect company names within news content. The script filters news items that explicitly mention any company and compiles these into a new DataFrame with relevant columns, prepared for further analysis or export.

In [36]:
import pandas as pd
import json
import re

# Load news data
news_data = pd.read_excel('News_original.xlsx')  # Load Excel file
news_data['NewsContent'].fillna('', inplace=True)

# Ensure the Title column is of string type
news_data['Title'] = news_data['Title'].astype(str)

# Clean the Title column
news_data['Title'] = news_data['Title'].apply(lambda x: re.sub(r'S\*?ST', '', x))

# Load the list of companies
with open('A_share_list.json', 'r', encoding='utf-8') as file:
    a_share_list = json.load(file)

# Clean company names in the list by removing terms like ST, S*ST, SST
for item in a_share_list:
    item['name'] = re.sub(r'S\*?ST', ' ', item['name'])
    item['fullname'] = re.sub(r'S\*?ST', ' ', item['fullname'])

# Create a list of company names
name_list = [item['name'] for item in a_share_list] + [item['fullname'] for item in a_share_list]

# Compile regular expression
pattern = re.compile('|'.join(map(re.escape, name_list)), flags=re.IGNORECASE)

# Extract relevant company names
news_data['Explicit_Company'] = news_data['NewsContent'].apply(
    lambda content: ', '.join(set(match for match in pattern.findall(content)))
)

# Filter out entries with empty Explicit_Company
filtered_news_data = news_data[news_data['Explicit_Company'] != '']

# Retain only NewsID, NewsContent, and Explicit_Company columns
result_news_data = filtered_news_data[['NewsID', 'NewsContent', 'Explicit_Company']]

# Save the processed data
# result_news_data.to_excel('result_news_data.xlsx', index=False)



In [37]:
result_news_data

Unnamed: 0,NewsID,NewsContent,Explicit_Company
0,1,本报记者 田雨 李京华 中国建设银行股份有限公司原董事长张恩照受贿案３日一审宣...,"建设银行, 中国建设银行股份有限公司"
1,2,中国农业银行信用卡中心由北京搬到上海了！ 农行行长杨明生日前在信用卡中心揭牌仪式上...,农业银行
2,3,在新基金快速发行以及申购资金回流的情况下，市场总体上呈现资金流动性过剩格局，考虑到现阶段...,"中国国航, 外运发展"
3,4,胜利股份（000407）公司子公司填海造地2800亩，以青岛的地价估算，静态价值在10亿...,胜利股份
7,8,由于全球最大的俄罗斯Uralkaly钾矿被淹，产量大减，同时满洲里口岸铁路在修复线，导致...,冠农股份
...,...,...,...
1037030,1037031,每经AI快讯，有投资者在投资者互动平台提问：请问公司目前有没有电解槽产能，规划情况能否详细介...,亿华通
1037031,1037032,依米康（SZ 300249，收盘价：10.38元）发布公告称，2023年10月12日，依米康...,"中泰证券, 依米康"
1037032,1037033,天风证券10月13日发布研报称，给予中核科技（000777.SZ，最新价：13.03元）买入...,"天风证券, 中核科技"
1037033,1037034,有投资者提问：抗癌药CPT获批后，公司是否应该按照股权协议继续收购沙东股权，适应症为MM的C...,海特生物


In [29]:
# Save the modified DataFrame to a new XLSX file
result_news_data.to_excel('result_news_data.xlsx', index=False)


In [9]:
result_news_data = pd.read_excel('result_news_data.xlsx')

# BERT-Base-Chinese-Finetuning-Financial-News-Sentiment-V2 Model

This script uses BERT-Base-Chinese-Finetuning-Financial-News-Sentiment-V2 Model, a pre-trained machine learning model, for sentiment analysis of Chinese financial news. It loads a specific BERT model optimized for this task, checks for GPU availability for faster processing, and applies sentiment analysis to a dataset of news content. The analysis results are then appended as a new column in the dataset, indicating the sentiment of each news item.

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import pandas as pd

# Load the pretrained BERT model and tokenizer
model_name = "hw2942/bert-base-chinese-finetuning-financial-news-sentiment-v2"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Check CUDA availability and move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Load news data
result_news_data_with_sentiment = pd.read_excel('result_df.xlsx')
# result_news_data = pd.read_excel('result_news_data.xlsx')

# Perform sentiment analysis
sentiment_analysis = lambda text: (
    int(torch.nn.functional.softmax(
        model(**tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)).logits, dim=-1
    )[:, 2].item() > torch.nn.functional.softmax(
        model(**tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)).logits, dim=-1
    )[:, 0].item())
)

# Apply sentiment analysis to the 'NewsContent' column and add the results as a new 'label' column
result_news_data_with_sentiment['label'] = [sentiment_analysis(content) for content in result_news_data['NewsContent']]


In [None]:
result_news_data_with_sentiment

Unnamed: 0,NewsID,NewsContent,Explicit_Company,label
0,1,本报记者 田雨 李京华 中国建设银行股份有限公司原董事长张恩照受贿案３日一审宣...,建设银行,0
1,2,中国农业银行信用卡中心由北京搬到上海了！ 农行行长杨明生日前在信用卡中心揭牌仪式上...,农业银行,1
2,3,在新基金快速发行以及申购资金回流的情况下，市场总体上呈现资金流动性过剩格局，考虑到现阶段...,"外运发展, 中国国航",1
3,4,胜利股份（000407）公司子公司填海造地2800亩，以青岛的地价估算，静态价值在10亿...,胜利股份,1
4,8,由于全球最大的俄罗斯Uralkaly钾矿被淹，产量大减，同时满洲里口岸铁路在修复线，导致...,冠农股份,1
...,...,...,...,...
187293,1037007,10月13日，今日共有43只涨停股，5只跌停股。其中，涨停股主要集中在华为概念股、减肥药概念...,"模塑科技, 龙版传媒, 莎普爱思, 光洋股份, 通化金马, 圣龙股份, 通宇通讯, 欧菲光",0
187294,1037009,吉电股份10月13日在交易所互动平台中披露，截至10月10日公司股东户数为171303户，较...,吉电股份,0
187295,1037025,10月12日晚间，三星医疗发布2023年前三季度业绩预告，公司预计前三季度实现归属于母公司所...,三星医疗,1
187296,1037030,每经AI快讯，有投资者在投资者互动平台提问：公司领导，请问公司经营是不是出现重大问题了，股票...,亿华通,0


In [12]:
# Save the modified DataFrame to a new XLSX file
result_news_data_with_sentiment.to_excel('result_df_with_sentiment.xlsx', index=False)