# DSAA 5002 - Data Mining and Knowledge Discovery in Data Science

# Task 1 (50 marks) Data Preprocessing and Analysis

**Background: 
Assuming you are a sentiment analyst at a securities firm, your task is to assess the impact of each news article on the A-share listed companies explicitly mentioned. For instance, on October 14, 2022, the China Securities Journal(中国证券报) reported the following:**

## Q1. Data Preprocessing - Noise Removal
### 1. Company Name Processing and Trigger Words Generation

In [2]:
!pip install jieba

Collecting jieba
  Downloading jieba-0.42.1.tar.gz (19.2 MB)
     ---------------------------------------- 0.0/19.2 MB ? eta -:--:--
     ---------------------------------------- 0.0/19.2 MB ? eta -:--:--
     ---------------------------------------- 0.0/19.2 MB ? eta -:--:--
     ---------------------------------------- 0.0/19.2 MB ? eta -:--:--
     --------------------------------------- 0.0/19.2 MB 262.6 kB/s eta 0:01:14
     --------------------------------------- 0.1/19.2 MB 409.6 kB/s eta 0:00:47
     --------------------------------------- 0.1/19.2 MB 409.6 kB/s eta 0:00:47
     --------------------------------------- 0.1/19.2 MB 409.6 kB/s eta 0:00:47
     --------------------------------------- 0.1/19.2 MB 409.6 kB/s eta 0:00:47
     --------------------------------------- 0.1/19.2 MB 291.5 kB/s eta 0:01:06
     --------------------------------------- 0.1/19.2 MB 291.5 kB/s eta 0:01:06
     --------------------------------------- 0.1/19.2 MB 252.2 kB/s eta 0:01:16
     ------

In [3]:
import json
import jieba
import re

# Read in the JSON File of "A_share Company Information"
with open("D:\\ProjectHub\\Jupyter Notebook\\DSAA 5002 DM\\DM-Project\\News_input\\A_share_list\\A_share_list.json", "r", encoding="utf-8") as file:
    a_share_list = json.load(file)

# String Processing and Nickname Generation for All the Company Names
for company in a_share_list:
    # 1.Share-name processing
    company["name"] = re.sub(r'[a-zA-Z@#$%^&*()！~\[\]{};:,.<>?/\\|]', '', company["name"])   
    
    # 2.Fullname simplifying
    fullname = company["fullname"]
    if "股份有限公司" in fullname:
        partname = fullname.replace("股份有限公司", "")
    else:
        partname = fullname
    company["partname"] = partname
    
    # 3.Nickname(abbreviation) generation
    partname = re.sub(r'[(（][^)）]+[)）]', '', partname) #
    
    if len(company["partname"]) <=3: 
        company["abbreviation"] = company["partname"]
    else: # Generate Nickname using JIEBA Tokenization when the simplified fullname is not simple enough
        words = jieba.lcut(partname)  
        if len(words) == 1:
            words = jieba.lcut(partname)  
            company["abbreviation"] = company["partname"]
        elif len(words[0]) >= 2 and words[0][:2] != company["name"][:2]:
            #
            company["abbreviation"] = "".join(words[1:])  
        elif len(words[0]) == 1 and words[0]+words[1][:1] != company["name"][:2]:
            #
            company["abbreviation"] = "".join(words[2:])
        else:
            abbreviation = "".join(words)
            company["abbreviation"] = abbreviation
    
    # 如果abbreviation的长度小于等于3，处理字符串
    if len(company["abbreviation"]) <= 3:
        # 去除name中的英文以及特殊字符
        abbreviation_correct = company["name"] 
        # 去除“股份”、"ST"和"*ST"
        if "股份" in abbreviation_correct:
            abbreviation_correct = abbreviation_correct.replace("股份", "")    
        if abbreviation_correct not in company["abbreviation"]:
            company["abbreviation"] = abbreviation_correct
        
    print(f'{company["name"]}:{company["partname"]} or {company["abbreviation"]}')
        

# 创建新的JSON文件并写入数据
with open("D:\\ProjectHub\\Jupyter Notebook\\DSAA 5002 DM\\DM-Project\\News_input\\A_share_list\\new_A_share_list.json", "w", encoding="utf-8") as new_file:
    json.dump(a_share_list, new_file, ensure_ascii=False, indent=2)

print("New JSON file created with 'partname' and 'abbreviation' fields.")


Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\HUAWEI\AppData\Local\Temp\jieba.cache
Loading model cost 0.496 seconds.
Prefix dict has been built successfully.


邵阳液压:邵阳维克液压 or 邵阳维克液压
同益中:北京同益中新材料科技 or 同益中新材料科技
华瓷股份:湖南华联瓷业 or 华联瓷业
鸿富瀚:深圳市鸿富瀚科技 or 鸿富瀚科技
高铁电气:中铁高铁电气装备 or 高铁电气装备
严牌股份:浙江严牌过滤技术 or 严牌过滤技术
百胜智能:江西百胜智能科技 or 百胜智能科技
青岛食品:青岛食品 or 青岛食品
德昌股份:宁波德昌电机 or 德昌电机
中自科技:中自环保科技 or 中自环保科技
富吉瑞:北京富吉瑞光电科技 or 富吉瑞光电科技
新瀚新材:江苏新瀚新材料 or 新瀚新材料
春雪食品:春雪食品集团 or 春雪食品集团
孩子王:孩子王儿童用品 or 孩子王儿童用品
丽臣实业:湖南丽臣实业 or 丽臣实业
珠海冠宇:珠海冠宇电池 or 珠海冠宇电池
百普赛斯:北京百普赛斯生物科技 or 百普赛斯生物科技
多瑞医药:西藏多瑞医药 or 多瑞医药
亚康股份:北京亚康万玮信息技术 or 亚康万玮信息技术
凯盛新材:山东凯盛新材料 or 凯盛新材料
中国能建:中国能源建设 or 中国能源建设
大地海洋:杭州大地海洋环保 or 大地海洋环保
华尔泰:安徽华尔泰化工 or 华尔泰化工
中捷精工:江苏中捷精工科技 or 中捷精工科技
星华反光:杭州星华反光材料 or 星华反光材料
君亭酒店:浙江君亭酒店管理 or 君亭酒店管理
纽威数控:纽威数控装备（苏州） or 纽威数控装备
上海港湾:上海港湾基础建设（集团） or 上海港湾基础建设
显盈科技:深圳市显盈科技 or 显盈科技
万事利:杭州万事利丝绸文化 or 万事利丝绸文化
开勒股份:开勒环境科技（上海） or 开勒环境科技
力量钻石:河南省力量钻石 or 力量钻石
海锅股份:张家港海锅新能源装备 or 海锅新能源装备
金三江:金三江（肇庆）硅材料 or 金三江硅材料
兰卫医学:上海兰卫医学检验所 or 兰卫医学检验所
匠心家居:常州匠心独具智能家居 or 匠心独具智能家居
禾信仪器:广州禾信仪器 or 禾信仪器
振华新材:贵州振华新材料 or 振华新材料
本立科技:浙江本立科技 or 本立科技
上海艾录:上海艾录包装 or 上海艾录包装
维远股份:利华益维远化学 or 维远化学
卓锦股份:浙江卓锦环保科技 or 卓锦环保科技
中兰环保:中兰环保科技 or 中兰环保科技


New JSON file created with 'partname' and 'abbreviation' fields.


In [4]:
def create_company_lookup(company_list):
    # Use a set deduplicating the items
    company_set = set()
    for company in company_list:
#         company_set.add(company["partname"])
#         company_set.add(company["name"])
#         company_set.add(company["abbreviation"]) 
#         Judge the Inclusion Relationship to reduce items
        if company["abbreviation"] not in company["partname"] and company["name"] not in company["partname"]:
            company_set.add(company["partname"])
        if company["abbreviation"] in company["name"]:
            company_set.add(company["abbreviation"]) 
        elif company["name"] in company["abbreviation"]:
            company_set.add(company["name"]) 
        else:
            company_set.add(company["abbreviation"]) 
            company_set.add(company["name"])
    
    print("The number of company name: {}".format(len(company_list)))
    print("The number of company name to be searched: {}".format(len(company_set)))
    return company_set

# Create the Trigger Words Set of A_share Company name
with open("D:\\ProjectHub\\Jupyter Notebook\\DSAA 5002 DM\\DM-Project\\News_input\\A_share_list\\new_A_share_list.json", "r", encoding="utf-8") as file:
    a_share_list = json.load(file)
company_set = create_company_lookup(a_share_list)


The number of company name: 4654
The number of company name to be searched: 5959


### 2. Search by Rules of Trigger Words

In [6]:
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import threading

# 创建一个全局的锁，用于确保线程安全
lock = threading.Lock()

# 处理一行数据的函数
def process_row(row, company_set, result_list, drop_list):
    title = row["Title"]
    news_content = row["NewsContent"]

    if pd.notna(title) and pd.notna(news_content):
        if any(company in title or company in news_content for company in company_set):
            # 将这一行的四列信息合并为一行
            combined_row = {
                "NewsID": row["NewsID"],
                "Title": title,
                "NewsContent": news_content,
                "NewsSource": row["NewsSource"]
            }
            with lock:
                result_list.append(combined_row)  # 将满足条件的行添加到结果列
        else:
            # 将这一行的四列信息合并为一行
            combined_row = {
                "NewsID": row["NewsID"],
                "Title": title,
                "NewsContent": news_content,
                "NewsSource": row["NewsSource"]
            }
            with lock:
                drop_list.append(combined_row)  # 将满足条件的行添加到结果列

# 主处理函数
def process_data(start, end, full_data, company_set, result_list, drop_list):
    for i in range(start, end):
        if i%1000 == 0:
            print("{} rows have been done".format(i))
            print("--- ")
        process_row(full_data.iloc[i], company_set, result_list, drop_list)

In [8]:
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
     ---------------------------------------- 0.0/250.0 kB ? eta -:--:--
     - -------------------------------------- 10.2/250.0 kB ? eta -:--:--
     ---- -------------------------------- 30.7/250.0 kB 435.7 kB/s eta 0:00:01
     ------------ ------------------------ 81.9/250.0 kB 657.6 kB/s eta 0:00:01
     ----------------------- ------------ 163.8/250.0 kB 984.6 kB/s eta 0:00:01
     -------------------------------------- 250.0/250.0 kB 1.1 MB/s eta 0:00:00
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2


In [9]:
# 读取Excel文件
full_data = pd.read_excel("D:\\ProjectHub\\Jupyter Notebook\\DSAA 5002 DM\\DM-Project\\News_input\\News_test.xlsx")

# 分割数据块
num_threads = 21  # 指定线程数量
chunk_size = len(full_data) // num_threads

threads = []
result_list = []
drop_list= []
print("Processing data with {} threads...".format(num_threads))

for i in range(num_threads):
    start = i * chunk_size
    end = (i + 1) * chunk_size if i < num_threads - 1 else len(full_data)
    thread = threading.Thread(target=process_data, args=(start, end, full_data, company_set, result_list, drop_list))
    threads.append(thread)
    print("Thread {} is processing rows {} to {}...".format(i + 1, start, end))
    
for thread in threads:
    thread.start()

for thread in threads:
    thread.join()

print("All threads have finished processing.\n")

print("Before selected by Rule, number of news is: {}".format(len(full_data)))
print("After selected by Rule, number of news is: {}".format(len(result_list)))

# 合并结果
result_df = pd.concat([pd.DataFrame(result_list)])
drop_df = pd.concat([pd.DataFrame(drop_list)])


Processing data with 21 threads...
Thread 1 is processing rows 0 to 71...
Thread 2 is processing rows 71 to 142...
Thread 3 is processing rows 142 to 213...
Thread 4 is processing rows 213 to 284...
Thread 5 is processing rows 284 to 355...
Thread 6 is processing rows 355 to 426...
Thread 7 is processing rows 426 to 497...
Thread 8 is processing rows 497 to 568...
Thread 9 is processing rows 568 to 639...
Thread 10 is processing rows 639 to 710...
Thread 11 is processing rows 710 to 781...
Thread 12 is processing rows 781 to 852...
Thread 13 is processing rows 852 to 923...
Thread 14 is processing rows 923 to 994...
Thread 15 is processing rows 994 to 1065...
Thread 16 is processing rows 1065 to 1136...
Thread 17 is processing rows 1136 to 1207...
Thread 18 is processing rows 1207 to 1278...
Thread 19 is processing rows 1278 to 1349...
Thread 20 is processing rows 1349 to 1420...
Thread 21 is processing rows 1420 to 1500...
0 rows have been done
--- 
1000 rows have been done
--- 
All t

In [6]:
# 保存筛选后的数据到新的Excel文件
result_df.to_excel("D:\\ProjectHub\\Jupyter Notebook\\DSAA 5002 DM\\DM-Project\\News_output\\filtered_News_test.xlsx", index=False)
drop_df.to_excel("D:\\ProjectHub\\Jupyter Notebook\\DSAA 5002 DM\\DM-Project\\News_output\\droped_News_test.xlsx", index=False)