## Using DeepSeek-Chat to Extract Corporate Vulnerabilities mentioned in Risk Factors Report

**Chunking by Natural Language Toolkit (NLTK)**

In [20]:
from langchain_openai import OpenAI
from langchain.text_splitter import NLTKTextSplitter
import os
import transformers
from openai import OpenAI
from dotenv import load_dotenv
import pandas as pd
import dask.bag as db

load_dotenv()

# file = open("/Users/xenialu/Chengdu80_prototype/report_data_fetched/AAPL_Risk_Factors.txt",'r')
# text = file.read()
# file.close()

# base_url="https://api.deepseek.com"
api_key = os.getenv('DEEPSEEK_API_KEY')  
# client = OpenAI(api_key=api_key, base_url=base_url)

# chunking through Natural Language Toolkit (nltk)
def chunk_text_nltk(text):
    text_splitter = NLTKTextSplitter()
    docs = text_splitter.split_text(text)
    return docs

chunk_text_nltk(text)

['Item 1A.\n\nRisk Factors\nThe Company’s business, reputation, results of operations, financial condition and stock price can be affected by a number of factors, whether currently known or unknown, including those described below.\n\nWhen any one or more of these risks materialize from time to time, the Company’s business, reputation, results of operations, financial condition and stock price can be materially and adversely affected.\n\nBecause of the following factors, as well as other factors affecting the Company’s results of operations and financial condition, past financial performance should not be considered to be a reliable indicator of future performance, and investors should not use historical trends to anticipate results or trends in future periods.\n\nThis discussion of risk factors contains forward-looking statements.\n\nThis section should be read in conjunction with Part II, Item 7, “Management’s Discussion and Analysis of Financial Condition and Results of Operations” 

**Define DeepSeek API Call**

In [30]:
def query_deepseek(message, api_key):
    base_url="https://api.deepseek.com"
    client = OpenAI(api_key=api_key, base_url=base_url)

    # results = []
    response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant expertise in corporate risk analysis."},
        {"role": "user", "content": message},
    ],
    stream=False
    )
    result = response.choices[0].message.content
    # print(result)
    return result


**Using a Dask for batch processing**

In [43]:
# concurrently extract key corporate vulnerabilities in each chunk with DeepSeek-Chat
def deepseek_by_chunks(chunks):
    b = db.from_sequence(chunks, npartitions=len(chunks))  # Adjust npartitions depending on your data size and available cores
    processed_bag = b.map(lambda x: query_deepseek(x, api_key))
    processed_sentences = processed_bag.compute() 
    print(f"reduced text from {len(''.join(chunks))} to {len(''.join(processed_sentences))} characters")
    return processed_sentences

# cv is short for corporate vulnerabilities
# takes about 1-2 minutes to run
def extract_cv_by_chunks(text):
    print("Corporate Vulnerabilities Extraction in Progress ...")
    sentences = chunk_text_nltk(text)
    prompt = "You are given a portion of the corporate's 10-K Anual Report of Risk Factors section. Please extract the key corporate vulnerabilities described which could be useful for future risk analysis concisely and accurately. Only output the extraction result. Here is the report segment: \n"
    messages = [prompt + sentence for sentence in sentences]
    return "".join(deepseek_by_chunks(messages))

# takes about 1-2 minutes to run
def summarise_cv_by_risk_type(processed_text):
    print("Summarising Corporate Vulnerabilities by Risk Type ...")
    risk_types = ["Operational Risk", "Legal Risk", "Loan Risk", "Other Risk besides operational, legal, and loan"]
    prompts = [f"You are given a corporate vulnerabilities description. Please summarise the most relevant vulnerabilities for future risk analysis on **{risk_type}**. Only output the extraction result. Here is the report: \n" for risk_type in risk_types]
    messages = [prompt + processed_text for prompt in prompts]
    # processed_messages = deepseek_by_chunks(messages)
    processed_messages = list(map(lambda x: query_deepseek(x, api_key), messages))

    result_dict = {}
    for i, risk_type in enumerate(risk_types):
        result_dict[risk_type] = processed_messages[i]
    print("Summarising Corporate Vulnerabilities by Risk Type Completed")
    return result_dict

processed_text = extract_cv_by_chunks(text)
result_dict =summarise_cv_by_risk_type(processed_text)



Summarising Corporate Vulnerabilities by Risk Type ...
Summarising Corporate Vulnerabilities by Risk Type Completed


In [44]:
result_dict

{'Operational Risk': '1. **International Trade Restrictions**: Tariffs, import/export controls, and geopolitical tensions could disrupt operations, supply chains, and sales distribution.\n2. **Supply Chain Vulnerability**: Outsourcing manufacturing exposes the company to risks from trade restrictions and geopolitical tensions.\n3. **Natural Disasters and Climate Change**: Operations in disaster-prone regions increase risk of manufacturing and supply chain disruptions.\n4. **Cybersecurity and Ransomware Attacks**: Threats of cyberattacks and ransomware could disrupt operations and service offerings.\n5. **Public Health Issues**: Pandemics like COVID-19 can interrupt business operations, manufacturing, and supply chains.\n6. **Single or Limited Source Dependencies**: Reliance on single or limited sources for critical components increases vulnerability to business interruptions.\n7. **Industrial Accidents**: Risks of accidents at suppliers and contract manufacturers could lead to signific

---