# **Governance Database Extraction and Preprocessing**

Governance Database Proj retrieves data directly from OPENDART to build a database of KOSPI-listed corporations and executive status available through posted disclosures. 

For stability, information is pulled directly from OPENDART API where possible. Almost all OPENDART calls require a single API call per corporation or report request, which is reflected in the total execution time. Only one function relies on OpenDartReader (to search for direct links to audit committee information used for governance check). [OPENDART API limits](https://engopendart.fss.or.kr/cop/bbs/selectArticleDetail.do) are as follows: 
- Individual: 20,000 calls a day (the limit is for all 83 API services and not by service)
- Corporation (business registration and registered IP)
    - 2 services ("Search disclosures" and "Overview of corporate status"): Unlimited
    - 81 services (excluding "Search disclosures" and "Overview of corporate status"): 20,000 calls a day (the limit is for all 81 API services and not by service)
- 1,000 calls per minute

### Environment Set Up 
Runs on Python 3.13.5. The cell below will check that the current kernel is using the correct Python version and raise an error otherwise. To set a virtual environment, execute the following lines in the terminal: 

    python3.13.5 -m venv virtualenv

    virtualenv\Scripts\activate

In [1]:
import sys
assert sys.version_info >= (3, 13, 5)

### Outputs

By the end of the project, the following two databases will be produced: 
1. **executive_df**, providing details on the 15k+ listed executives, including information such as registered officer status, shareholder relations, salary, and professional experience.

2. **summary_df**,  a grouped dataset across corp-level information, including number of directory types, audit committee size, and total assets from the past three years (used to determine audit committee mandate).

*navigate to README.md file for reference*

<br> 
The cell below checks for the necessary folders. If it returns False, create a folder (at the same directory level as notebooks, not within) labeled 'data' and within it, two subfolders: 'raw' and 'processed'.

In [2]:
import os

data_dir = os.path.join(r'..\..', 'data')
raw_dir = os.path.join(data_dir, 'raw')
processed_dir = os.path.join(data_dir, 'processed')

print(os.path.isdir(raw_dir) and os.path.isdir(processed_dir))

True


### Packages 

Update **API_key**, **bsns_year**, and **reprt_code** as needed.

OPENDART reprt_code: 
- First Quarterly Report : 11013
- Semi-annual Report : 11012
- Third Quarterly Report : 11014
- Annual Report : 11011


In [None]:
import os
import io
import re
import time
import zipfile 

import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

from datetime import datetime
import xml.etree.ElementTree as ET
from concurrent.futures import ThreadPoolExecutor 

import dart_fss
import OpenDartReader

API_key = '0d67945133e224c451452e071e0d8349969353e1' 

dart = OpenDartReader(API_key)
dart_fss.set_api_key(API_key)

bsns_year = '2024'
reprt_code = '11011'
reference_date = datetime(2025, 8, 12) # used for tenure calculation

### **Data Extraction (data_extraction_ipynb)**

Produces all the raw data files necessary for preprocessing. All 7 csv files are saved in the raw data folder.

In [4]:
BASE_DATA_DIR = os.path.join(r'..\..', 'data', 'raw')
os.makedirs(BASE_DATA_DIR, exist_ok=True)
print(f"Raw data directory exists at: {BASE_DATA_DIR}")

Raw data directory exists at: ..\..\data\raw


#### 0. save_df_to_csv
</b>

a helper function, called at the end of each function to save outputs as csv files within raw data folder

In [5]:
def save_df_to_csv(df: pd.DataFrame, file_path: str, index: bool = False):
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    try:
        df.to_csv(file_path, index=index)
        print(f"DataFrame saved to {file_path}")
    except Exception as e:
        print(f"Error saving DataFrame to CSV {file_path}: {e}")

#### 1. get_corp_code
</b>


pulls the most up to date list of corp codes with a single API call to the OPENDART zip file. Corp codes are unique reference codes assigned by OPENDART, distinct from stock number and used as required keys to access and pull full company (2. get_kospi_company_info) and executive (3. get_executive_status_data) info. 
</b>

[OPENDART | Guide for Developers to Corporation code](https://engopendart.fss.or.kr/guide/detail.do?apiGrpCd=DE001&apiId=AE00004)
</b>

Required Key: 
</b>

- crtfc_key (API key)

In [6]:
def get_corp_code(api_key: str, output_dir: str = BASE_DATA_DIR) -> pd.DataFrame:
    url_code = f'https://opendart.fss.or.kr/api/corpCode.xml?crtfc_key={api_key}'
    response = requests.get(url_code) 

    # check that the target directory exists 
    os.makedirs('dart_data', exist_ok=True)

    # unzip and extract CORPCODE.xml
    with zipfile.ZipFile(io.BytesIO(response.content)) as z:
        z.extractall('dart_data')
        xml_path = os.path.join('dart_data', 'CORPCODE.xml')

    # parse XML
    tree = ET.parse(xml_path)
    root = tree.getroot()

    # filter for listed companies (6-digit stock code only) and append to df
    corp_list = []
    for corp in root.findall('list'):
        stock_code = corp.findtext('stock_code')
        if stock_code and len(stock_code) == 6:
            corp_list.append({
                'corp_code': corp.findtext('corp_code'),
                'corp_name': corp.findtext('corp_name'),
                'corp_eng_name': corp.findtext('corp_eng_name'),
                'stock_code': stock_code
            })

    #save in raw data folder 
    corp_codes_df = pd.DataFrame(corp_list)
    output_filepath = os.path.join(output_dir, 'listed_corp_codes.csv')
    save_df_to_csv(corp_codes_df, output_filepath)
    return corp_codes_df

In [7]:
all_corp_codes_df = get_corp_code(API_key)

DataFrame saved to ..\..\data\raw\listed_corp_codes.csv


#### 2. get_kospi_company_info
</b>

passes in the list of corp codes from get_corp_code, filters for kospi codes, and fetches all detailed company info. OPENDART requires an individual call for each *corp_code*, resulting in a total execution time of: ~ 9 minutes for 3,000+ calls.

[OPENDART | Guide for Developers to Overview of corporate status](https://engopendart.fss.or.kr/guide/detail.do?apiGrpCd=DE001&apiId=AE00002)

Required Keys: 
- crtfc_key (API key) 
- corp_code 

Kept Data: 
</b>


| Key  | Name | 
| -------|-----|
| corp_name | Formal name	  | 
| stock_code  | Stock item code	  | 
| ceo_nm  | Representative name  |
| induty_code*  |  Industry code   | 

*induty_code: not relevant now, could be used later to compare industry norms

Dropped Data (corp code and name are sufficient for identification):
| Key  | Name |
| -------|-----|
| corp_name_eng  | English name	  | 
| stock_name | Item name 	  | 
| corp_cls  |  Corporation type   | 
| jurir_no  |  Corporate registration No.   |
| bizr_no | Business registration No.  | 
| adres | Address  | 
| hm_url  | Website URL  | 
| ir_url  | IR website  | 
| phn_no |  Telephone No.   | 
| fax_no  | Fax No.  | 
| est_dt  | Establishment date (YYYYMMDD)  | 
| acc_mt  | Month of settlement (MM)  |

In [8]:
def get_kospi_company_info(api_key: str, corp_codes_df: pd.DataFrame, output_dir: str = BASE_DATA_DIR) -> pd.DataFrame:
    data = []
    api_endpoint = "https://engopendart.fss.or.kr/engapi/company.json"

    for i, row in corp_codes_df.iterrows(): # iterate over list of corp_codes
        corp_code = row['corp_code']
        corp_name = row['corp_name']

        params = { # OPENDART required keys
            'crtfc_key': API_key,
            'corp_code': corp_code
        }
        try:
            response = requests.get(api_endpoint, params=params)
            response.raise_for_status() # raise HTTPError for bad responses 
            info = response.json()

            # filter for KOSPI companies 
            if info and info.get('corp_cls') == 'Y': # all types: Y (KOSPI), K (KOSDAQ), N (KONEX), E (Other)
                data.append({
                    'corp_name': info.get('corp_name'),
                    'corp_code': info.get('corp_code'),
                    'stock_code': info.get('stock_code'),
                    'ceo_name': info.get('ceo_nm'),
                    'industry_code': info.get('induty_code'),
                })
            time.sleep(0.07) # respect API limit
        except Exception as e:
            print(f"Failed to fetch company info for {corp_name} ({corp_code}): {e}")
            continue
    
    # save in raw data folder
    kospi_codes_df = pd.DataFrame(data)
    output_filepath = os.path.join(output_dir, 'kospi_company_info.csv')
    save_df_to_csv(kospi_codes_df, output_filepath)
    return kospi_codes_df

In [9]:
kospi_company_info_df = get_kospi_company_info(api_key=API_key, corp_codes_df=all_corp_codes_df)

DataFrame saved to ..\..\data\raw\kospi_company_info.csv


#### 3. get_executive_status_data

retrieves executive-level data by passing in the list of KOSPI listed corps. OPENDART requires an individual API call per corporation, with 850 calls ~ 2 minutes.
As the function iterates over the KOSPI corps, it flags any that have no executive data available.

[OPENDART | Guide for Developers to Status of executives](https://engopendart.fss.or.kr/guide/detail.do?apiGrpCd=DE002&apiId=AE00011)

Required Keys:
- crtfc_key (API key)
- corp_code 
- bsns_year (fiscal year)
- reprt_code
    - First Quarterly Report : 11013
    - Semi-annual Report : 11012
    - Third Quarterly Report : 11014
    - Annual Report : 11011

The resulting **executive_status_data_df** saves all the information available. In preprocessing, **exec_df** filters down based on the following: 

Kept Data: 
| Key  | Name | 
| -------|-----|
| rcept_no | Filing No.  | 
| corp_cls | Corporation type	  |
| corp_code | Corporation code	  | 
| corp_name | Corporation name	  | 
| nm | Name  |
| sexdstn | Gender  | 
| ofcps | Position  | 
| rgist_exctv_at | Registered officer status  | 
| fte_at | Full-time  | 
| chrg_job | Responsibilites  | 
| main_career | Professional Background  |
| mxmm_shrholdr_relate | Relationship to Largest Shareholder  | 
| hffc_pd | Period of employment  | 

Dropped Data: 
| Key  | Name | 
| -------|-----|
| birth_ym | Date of birth  | 
| tenure_end_on | Term expiration date  | 
| stlm_dt | Settlement date  | 

As tenure in company is sufficient for guaging expertise and *stlm_dt* is irrelevant given filter for year and report type.

In [10]:
def get_executive_status_data(api_key: str, kospi_codes_df: pd.DataFrame, bsns_year: int, reprt_code: str, output_dir: str = os.path.join('data', 'raw')) -> pd.DataFrame:
    results = []
    api_endpoint = "https://opendart.fss.or.kr/api/exctvSttus.json"

    for idx, row in kospi_codes_df.iterrows():
        corp_code = row['corp_code']
        corp_name = row['corp_name']
        stock_code = row['stock_code']

        params = { #OPENDART required keys
            'crtfc_key': api_key,
            'corp_code': corp_code,
            'bsns_year': bsns_year,
            'reprt_code': reprt_code
        }

        try:
            response = requests.get(api_endpoint, params=params)
            response.raise_for_status() # raise error for bad responses 
            data = response.json()

            if data['status'] == '000': # success - data found
                if 'list' in data and data['list']:
                    df = pd.DataFrame(data['list']) # appends all info, later filtered down in preprocessing
                    df['stock_code'] = stock_code
                    
                    results.append(df)
            else:
                print(f"No executive data available for {corp_name} ({corp_code}) for {bsns_year}/{reprt_code}.")

        except Exception as e:
            print(f"An unexpected error occurred for {corp_name} ({corp_code}): {e}")

        time.sleep(0.07) 

    if results:
        executive_status_df = pd.concat(results, ignore_index=True)
        output_filepath = os.path.join(output_dir, f'executive_status_{bsns_year}_{reprt_code}.csv')
        save_df_to_csv(executive_status_df, output_filepath)
        return executive_status_df
    else:
        print("\nNo executive status data was retrieved.")
        return pd.DataFrame() 

In [11]:
executive_status_data_df = get_executive_status_data(API_key, kospi_company_info_df, bsns_year, reprt_code) 

No executive data available for 미래에셋맵스 아시아퍼시픽 부동산공모 1호 투자회사 (00600013) for 2024/11011.
No executive data available for 맥쿼리한국인프라투융자회사 (00435297) for 2024/11011.
No executive data available for 한국투자ANKOR유전해외자원개발특별자산투자회사1호(지분증권) (00907013) for 2024/11011.
No executive data available for 케이비발해인프라투융자회사 (01880801) for 2024/11011.
No executive data available for 주식회사 대신밸류리츠위탁관리부동산투자회사 (01885222) for 2024/11011.
No executive data available for 대한조선 주식회사 (00182696) for 2024/11011.
DataFrame saved to data\raw\executive_status_2024_11011.csv


#### 4. get_total_assets  

makes calls to OPENDART's financial statements API to pull total assets from each corp. For each corp, if a consolidated report ('CFS') exists, it pulls information from that statement. Otherwise, it falls back on the seperate report ('OFS'). The function supports pulling other FS data, so long as sj_div and sj_nm are located correctly, and *target_account_names* is updated to reflect the target key words. The full function makes 850 (length of KOSPI codes) calls, with execution time ~ 3 minutes.


[OPENDART | Single company’s full financial statements 개발가이드](https://opendart.fss.or.kr/guide/detail.do?apiGrpCd=DE003&apiId=AE00036)

Required Keys:
- crtfc_key (API key)
- corp_code 
- bsns_year (fiscal year)
- reprt_code 
- fs_div (seperate/consolidated report)

The resulting **assets_YYYY_REPORT**  will be used to check requirements for mandated audit committees (corporations with total assets > $2T KRW). Because corporations have a two year grace period for forming a mandated audit committee, the function pulls total assets from the past three years. 

Kept Data: 
| Key  | Name | 
| -------|-----|
| rcept_no* | Filing No.  | 
| thstrm_amount	| Term amount |
| frmtrm_amount	| Previous term amount | 
| bfefrmtrm_amount	| Amount of term before previous | 

Dropped Data: 
| Key  | Name | 
| -------|-----|
| reprt_code | Report code	  | 
| bsns_year | Fiscal year	  | 
| corp_code | Corporation code	  | 
| sj_div** | Type of financial statement	  |
| sj_nm | Financial statement title	  |
| account_id | Account ID  |
| account_nm | Account name  | 
| account_detail | Detail account  |
| thstrm_nm	| Term name  | 
| thstrm_add_amount	 | Accumulated term amount	  | 
| frmtrm_nm	| Previous term name | 
| frmtrm_q_nm | Previous term name(Quarterly/Semiannual) | 
| frmtrm_q_amount | Previous term amount(Quarterly/Semiannual) | 
| frmtrm_add_amount	| Accumulated previous term amount  | 
| bfefrmtrm_nm	| Name of term before previous | 
| ord	| Account code sort order | 
| currency	| Currency unit |

*required key for subdoc searches in preprocessing

**function already filters for sj_div = BS, can change to retrieve data from other statements 

In [12]:
def get_total_assets(kospi_company_info_df: pd.DataFrame, bsns_year: str, reprt_code: str, API_key: str, output_dir: str = os.path.join('data', 'raw')) -> pd.DataFrame:
    """
    Fetches Total Assets for a list of companies from the DART API without helper functions.
    """
    api_url = 'https://opendart.fss.or.kr/api/fnlttSinglAcntAll.json'
    target_sj_div = "BS"
    target_account_names = {"자산총계", "총자산", "자산"} # checks for possible categories covering total assets 
    year = int(bsns_year)

    all_results = []

    for corp_code in kospi_company_info_df['corp_code']:
        rcept_no, assets, prior_assets, two_years_ago = None, None, None, None
        
        # try CFS first, fall back on OFS
        for fs_div in ['CFS', 'OFS']:
            params = {'crtfc_key': API_key, 'corp_code': corp_code, 'bsns_year': bsns_year, 'reprt_code': reprt_code, 'fs_div': fs_div}
            try:
                res = requests.get(api_url, params=params)
                res.raise_for_status()
                data = res.json()
                
                if data.get('status') == '000' and 'list' in data:
                    # search for assets data directly from the JSON list
                    for item in data['list']:
                        if item['sj_div'] == target_sj_div and item['account_nm'].strip().replace(' ', '') in target_account_names:
                            rcept_no = item.get('rcept_no')
                            assets = pd.to_numeric(item.get('thstrm_amount', '').replace(',', ''), errors='coerce')
                            prior_assets = pd.to_numeric(item.get('frmtrm_amount', '').replace(',', ''), errors='coerce')
                            two_years_ago = pd.to_numeric(item.get('bfefrmtrm_amount', '').replace(',', ''), errors='coerce')
                            break 
                    
                    if assets is not None:
                        break 
            
            except (requests.exceptions.RequestException, ValueError):
                continue
        
        if assets is None:
            print(f"Total Assets not found for {corp_code}.")

        all_results.append({
            'corp_code': corp_code,
            'rcept_no': rcept_no,
            f'{year}_total_assets': assets,
            f'{year - 1}_total_assets': prior_assets,
            f'{year - 2}_total_assets': two_years_ago
        })
        
        time.sleep(0.07)

    assets_df = pd.DataFrame(all_results)
    
    for y_offset in [0, 1, 2]:
        col_name = f'{year - y_offset}_total_assets'
        if col_name in assets_df.columns:
            assets_df[col_name] = pd.to_numeric(assets_df[col_name], errors='coerce').astype('Int64')

    os.makedirs(output_dir, exist_ok=True)
    output_path = os.path.join(output_dir, f"assets_{bsns_year}_{reprt_code}.csv")
    assets_df.to_csv(output_path, index=False, encoding='utf-8-sig')

    return assets_df

In [13]:
assets_df = get_total_assets(kospi_company_info_df, bsns_year, reprt_code, API_key)

Total Assets not found for 00600013.
Total Assets not found for 00435297.
Total Assets not found for 00907013.
Total Assets not found for 01880801.
Total Assets not found for 00112998.
Total Assets not found for 01885222.
Total Assets not found for 00182696.


#### 5. get_salary_type

pulls salary data from three OPDENDART source types: 
- Individual, which discloses the exact amount for executives making more than 500M KRW. 
- Grouped, which provides total annual grouped salary and average salaries by status type. 
- Unregistered, which provides the total annual grouped salary and average per person.

In the preprocessing notebook, salary is appended to each executive - exact where possible and average amounts otherwise. Because there are three separate API endpoints, **get_salary_type** contains two internal helped functions: **_get_json** and **_get_salary_data_for_corp**. When the main function is called, the list of KOSPI *corp_codes* is passed as keys to **_get_salary_data_for_corp**, which makes separate calls to each source type using **_get_json**. All the disclosed datapoints are then appended larger, consolidated dataframe saved as **salary_data_YYYY_REPORT** in the raw data folder, and refered to within this notebook as **salary_separate_df**. If any errors occur in retrieving a corp's data, the details will be flagged and printed. Each corp code makes 3 API requests, totaling 2550 executing in ~ 4 minutes.

1. [OPENDART | Guide for Developers to Remuneration for individual directors and auditors](https://engopendart.fss.or.kr/guide/detail.do?apiGrpCd=DE002&apiId=AE00013) (lists all those > 500m KRW)

2. [OPENDART | Guide for Developers to Remuneration for all directors and auditors (remuneration paid - by type)](https://engopendart.fss.or.kr/guide/detail.do?apiGrpCd=DE002&apiId=AE00030)

3. [OPENDART | Guide for Developers to Remuneration for unregistered executives](https://engopendart.fss.or.kr/guide/detail.do?apiGrpCd=DE002&apiId=AE00028)

Required Keys:
- crtfc_key (API key)
- corp_code 
- bsns_year (fiscal year)
- reprt_code 

The resulting **salary_separate_df** displays the salary data pulled from the three datapoints, with standardized columns for the purposes of merging with **exec_df** in the preprocessing notebook. The following chart maps the resulting **salary_separate_df**'s column names to the corresponding OPENDART source.

| salary_separate_df column  | (1.) Individual | (2.) All By Type | (3.) Unregistered | 
| -------|-----|-----|-----|
| position | ofcps | se (category)* | se	(unregistered) |
| compensation | mendng_totamt | psn1_avrg_pymntamt | jan_salary_am |
| salary_source | 개인별보수 | 임원전체보수유형 | 미등기임원 |
| salary_type | exact | estimate | estimate | 


*Category covers:

- Registered director (excluding outside directors and members of the audit committee)
- Outside director (excluding members of the audit committee)
- Member of the audit committee
- Auditors

In [14]:
def get_salary_type(kospi_company_info_df: pd.DataFrame, bsns_year: str, reprt_code: str, API_key: str, output_dir: str = os.path.join('data', 'raw')) -> pd.DataFrame:
    # === Internal Helper Function: DART API JSON request ===
    def _get_json(url, corp_code):
        """Helper to fetch JSON data from DART API and handle errors."""
        params = {
            'crtfc_key': API_key,
            'corp_code': corp_code,
            'bsns_year': bsns_year,
            'reprt_code': reprt_code
        }
        try:
            response = requests.get(url, params=params, timeout=10)
            response.raise_for_status()
            data = response.json()
            
            if data.get('status') != '000' or 'list' not in data:
                if data.get('status') != '000':
                    print(f"OPENDART Error for {corp_code}: {data.get('message')}")
                return []
            return data['list']
        except Exception as e:
            print(f"Request failed for {url} with params {params}: {e}")
            return []

    # === Internal Helper Function: Fetches data for a single company ===
    def _get_salary_data_for_corp(corp_code):
        """Fetches and consolidates salary data for a single company."""
        endpoints = {
            'individual': 'https://opendart.fss.or.kr/api/hmvAuditIndvdlBySttus.json',
            'unregistered': 'https://opendart.fss.or.kr/api/unrstExctvMendngSttus.json',
            'grouped': 'https://opendart.fss.or.kr/api/drctrAdtAllMendngSttusMendngPymntamtTyCl.json' 
        }
        
        results = []

        # 1. Individual executives (개인별 보수)
        for row in _get_json(endpoints['individual'], corp_code):
            results.append({
                'corp_code': corp_code,
                'name': row.get('nm'),
                'position': row.get('ofcps'),
                'compensation': row.get('mendng_totamt'), 
                'salary_source': '개인별보수',
                'salary_type': 'exact'
            })

        # 2. Unregistered executives (미등기 임원)
        for row in _get_json(endpoints['unregistered'], corp_code):
            results.append({
                'corp_code': corp_code,
                'name': '',
                'position': row.get('se'),
                'compensation': row.get('jan_salary_am'), 
                'salary_source': '미등기임원',
                'salary_type': 'estimate'
            })

        # 3. Grouped executives (임원 전체 보수 유형)
        for row in _get_json(endpoints['grouped'], corp_code):
            results.append({
                'corp_code': corp_code,
                'name': '',
                'position': row.get('se'),
                'compensation': row.get('psn1_avrg_pymntamt'),
                'salary_source': '임원전체보수유형',
                'salary_type': 'estimate'
            })
            
        return pd.DataFrame(results)

    # === Main loop to process all companies ===
    all_salary_data = []

    for corp_code in kospi_company_info_df['corp_code'].apply(lambda c: str(c).zfill(8)):
        df = _get_salary_data_for_corp(corp_code)
        
        if not df.empty:
            all_salary_data.append(df)
        
        # Respect DART API rate limits
        time.sleep(0.07)

    # Concatenate all individual DataFrames into one
    final_df = pd.concat(all_salary_data, ignore_index=True)
    
    # Save the final DataFrame to a CSV file
    os.makedirs(output_dir, exist_ok=True)
    output_path = os.path.join(output_dir, f"salary_separate_{bsns_year}_{reprt_code}.csv")
    final_df.to_csv(output_path, index=False, encoding='utf-8-sig')
    
    print(f"Salary data for {len(all_salary_data)} companies saved to: {output_path}")

    return final_df

In [15]:
salary_separate_df = get_salary_type(kospi_company_info_df, bsns_year, reprt_code, API_key)

OPENDART Error for 00600013: 조회된 데이타가 없습니다.
OPENDART Error for 00600013: 조회된 데이타가 없습니다.
OPENDART Error for 00600013: 조회된 데이타가 없습니다.
OPENDART Error for 00435297: 조회된 데이타가 없습니다.
OPENDART Error for 00435297: 조회된 데이타가 없습니다.
OPENDART Error for 00435297: 조회된 데이타가 없습니다.
OPENDART Error for 00907013: 조회된 데이타가 없습니다.
OPENDART Error for 00907013: 조회된 데이타가 없습니다.
OPENDART Error for 00907013: 조회된 데이타가 없습니다.
OPENDART Error for 01880801: 조회된 데이타가 없습니다.
OPENDART Error for 01880801: 조회된 데이타가 없습니다.
OPENDART Error for 01880801: 조회된 데이타가 없습니다.
OPENDART Error for 01885222: 조회된 데이타가 없습니다.
OPENDART Error for 01885222: 조회된 데이타가 없습니다.
OPENDART Error for 01885222: 조회된 데이타가 없습니다.
OPENDART Error for 00182696: 조회된 데이타가 없습니다.
OPENDART Error for 00182696: 조회된 데이타가 없습니다.
OPENDART Error for 00182696: 조회된 데이타가 없습니다.
Salary data for 844 companies saved to: data\raw\salary_separate_2024_11011.csv


#### 6. get_salary_total  

pulls the total salary for each corp. 

[OPENDART | Guide for Developers to Remuneration for all directors and auditors](https://engopendart.fss.or.kr/guide/detail.do?apiGrpCd=DE002&apiId=AE00014)

Required Keys:
- crtfc_key (API key)
- corp_code 
- bsns_year (fiscal year)
- reprt_code 

The resulting **salary_total_df** keeps all response variables. This includes: 
- nmpr (total headcount of all directors and auditors)
- mendng_totamt (total remuneration amount for all directors and auditors)
- jan_avrg_mendng_am (the average remuneration per person)

These points are used to check **summary_df** values, to ensure that total headcount and remuneration totals align. The total remuneration amount is then merged with **summary_df** as a *Total Compensation* column. The functions makes ~850 calls and executes in ~ 2 minutes.

In [16]:
def get_salary_total(kospi_company_info_df: pd.DataFrame, bsns_year: str, reprt_code: str, API_key: str, output_dir: str = os.path.join('data', 'raw')) -> pd.DataFrame:
    url = "https://opendart.fss.or.kr/api/hmvAuditAllSttus.json"

    # === Internal Helper Function: DART API JSON request ===
    def _get_json(url, corp_code):
        params = {
            'crtfc_key': API_key,
            'corp_code': corp_code,
            'bsns_year': bsns_year,
            'reprt_code': reprt_code
        }
        
        try:
            response = requests.get(url, params=params, timeout=10)
            response.raise_for_status()
            data = response.json()
            
            if data.get('status') != '000' or 'list' not in data:
                if data.get('status') != '000':
                    print(f"DART API Error for {corp_code}: {data.get('message')}")
                return []
            return data['list']
    
        except Exception as e:
            print(f"Request failed for {url} with params {params}: {e}")
            return []
        
    salary_total = []
    
    for corp_code in kospi_company_info_df['corp_code'].apply(lambda c: str(c).zfill(8)):
        data_list = _get_json(url, corp_code)

        if data_list:
            df = pd.DataFrame(data_list)
            salary_total.append(df)
        
        time.sleep(0.07)
        
    # concatenate all individual DataFrames into one
    final_df = pd.concat(salary_total, ignore_index=True)
    
    os.makedirs(output_dir, exist_ok=True)
    output_path = os.path.join(output_dir, f"salary_total_data_{bsns_year}_{reprt_code}.csv")
    final_df.to_csv(output_path, index=False, encoding='utf-8-sig')
    
    print(f"Salary data for {len(salary_total)} companies saved to: {output_path}")

In [17]:
salary_total_df = get_salary_total(kospi_company_info_df, bsns_year, reprt_code, API_key)

DART API Error for 00600013: 조회된 데이타가 없습니다.
DART API Error for 00435297: 조회된 데이타가 없습니다.
DART API Error for 00907013: 조회된 데이타가 없습니다.
DART API Error for 01880801: 조회된 데이타가 없습니다.
DART API Error for 01885222: 조회된 데이타가 없습니다.
DART API Error for 00182696: 조회된 데이타가 없습니다.
Salary data for 844 companies saved to: data\raw\salary_total_data_2024_11011.csv


#### 7. get_major_shareholder_data

pulls holding status of major shareholders. 

In the preprocessing notebook, shareholder status is merged with the exec data, such that if a registered or unregistered executive is listed as a major shareholder, their shares are appended to exec_df.

[OPENDART | Guide for Developers to Information on largest shareholder](https://engopendart.fss.or.kr/guide/detail.do?apiGrpCd=DE002&apiId=AE00008)

Required Keys:
- crtfc_key (API key)
- corp_code 
- bsns_year (fiscal year)
- reprt_code 

The resulting **major_shareholder_df** contains all response keys for potential further evaluation. When merged to **exec_df**, only *trmend_posesn_stock_qota_rt* (shareholding ratio at the end of the reporting period) is added as a *Shareholding Ratio* column. The following are not carried over: 
- bsis_posesn_stock_co	(number of stocks at the beginning of the reporting period)
- bsis_posesn_stock_qota_rt (shareholding ratio at the beginning of the reporting period)
- trmend_posesn_stock_co (number of stocks at the end of the reporting period)

Alternative Source: [OPENDART | Guide for Developers to Report of executives and major shareholders' ownership](https://engopendart.fss.or.kr/guide/detail.do?apiGrpCd=DE004&apiId=AE00041) which pulls stock transaction updates by executives and major shareholders


In [18]:
def get_major_shareholder_data(api_key: str, kospi_codes_df: pd.DataFrame, bsns_year: int, reprt_code: str, output_dir: str = os.path.join('data', 'raw')) -> pd.DataFrame:
    results = []
    api_endpoint = "https://opendart.fss.or.kr/api/hyslrSttus.json"
    total_corps = len(kospi_codes_df)

    for idx, row in kospi_codes_df.iterrows():
        corp_code = row['corp_code']
        corp = str(corp_code).zfill(8)
        corp_name = row['corp_name']

        params = {
            'crtfc_key': api_key,
            'corp_code': corp,
            'bsns_year': bsns_year,
            'reprt_code': reprt_code
        }

        try:
            response = requests.get(api_endpoint, params=params)
            response.raise_for_status()
            data = response.json()

            if data['status'] == '000':
                if 'list' in data and data['list']:
                    df = pd.DataFrame(data['list'])
                    results.append(df)
            elif data['status'] == '013':
                print(f"No shareholder data available for {corp_name} ({corp_code}) for {bsns_year}/{reprt_code}.")
            else:
                print(f"API Error for {corp_name} ({corp_code}): Status {data.get('status')}, Message: {data.get('message')}")

        except Exception as e:
            print(f"An unexpected error occurred for {corp_name} ({corp_code}): {e}")

        time.sleep(0.07)

    if results:
        shareholder_df = pd.concat(results, ignore_index=True)
        output_filepath = os.path.join(output_dir, f'major_shareholders_{bsns_year}_{reprt_code}.csv')
        save_df_to_csv(shareholder_df, output_filepath)
        print(f"\nSuccessfully fetched and saved major shareholder data for {len(shareholder_df)} records.")
        return shareholder_df
    else:
        print("\nNo major shareholder data was retrieved.")
        return pd.DataFrame()

In [19]:
major_shareholder_df = get_major_shareholder_data(API_key, kospi_company_info_df, bsns_year, reprt_code) 

No shareholder data available for 미래에셋맵스 아시아퍼시픽 부동산공모 1호 투자회사 (00600013) for 2024/11011.
No shareholder data available for 맥쿼리한국인프라투융자회사 (00435297) for 2024/11011.
No shareholder data available for 한국투자ANKOR유전해외자원개발특별자산투자회사1호(지분증권) (00907013) for 2024/11011.
No shareholder data available for 케이비발해인프라투융자회사 (01880801) for 2024/11011.
No shareholder data available for 주식회사 대신밸류리츠위탁관리부동산투자회사 (01885222) for 2024/11011.
No shareholder data available for 대한조선 주식회사 (00182696) for 2024/11011.
DataFrame saved to data\raw\major_shareholders_2024_11011.csv

Successfully fetched and saved major shareholder data for 9102 records.


### **Data Extraction (data_extraction_ipynb)**

Groups and concatonates data into two data dataframes: **exec_df** and **summary _df**. Unlike *data_extraction*, only the final cleaned and completed dfs will be saved to the processed data folder. 

#### 0. Build initial exec_df structure

From the raw data file **executive_status_data_df**, the following cell drops the columns identified in *data extraction* (3.) get_executive_data. 

In [20]:
exec_df = executive_status_data_df.drop(
    columns=['corp_cls', 'birth_ym', 'fte_at', 'tenure_end_on', 'stlm_dt'], 
    errors='ignore'
).rename(
    columns={
        'rcept_no': 'disclosure',
        'nm': 'name',
        'sexdstn': 'gender',
        'ofcps': 'position',
        'rgist_exctv_at': 'exec_status',
        'chrg_job': 'responsibilities',
        'mxmm_shrholdr_relate': 'largest_shareholder_relate',
        'hffc_pd': 'employment_period',
        'trmend_posesn_stock_qota_rt': 'shareholding_ratio'
    }
)

#### **exec_df**
#### 1. Parse Experience and Build Initial Structure

separate_career passes in the prior_work column in **exec_df** to parse and categorize into education and work experience. The function prioritizes sorting work related roles first, to avoid education related positions. Job keywords only contains these related terms as the fallback defaults to work experience if the specific education keywords don't exist within each parsed string.

In [21]:
def separate_career(career_string):
    if pd.isna(career_string):
        return np.nan, np.nan

    education = []
    work_experience = []
    
    # prioritized keywords
    job_keywords = ['교수', '총장', '강사', '연구원', '학장', '팀장', '실장', '감사', '대표', '회장', '이사']
    edu_keywords = ['학사', '석사', '박사', '대학교', '법학', '대학원', '졸업', '수료', 'Univ.', 'School', 'College', 'MBA', 'U.', 'Institute', 'University']

    career_items = career_string.split('\n')
    
    for item in career_items:
        # check for job keywords for education-related backgrounds to avoid sorting as education 
        if any(keyword in item for keyword in job_keywords):
            work_experience.append(item.strip())
        # if no job keywords, check for educational keywords
        elif any(keyword in item for keyword in edu_keywords):
            education.append(item.strip())
        # default to work experience for other entries
        else:
            work_experience.append(item.strip())
            
    return (
        education if education else np.nan,
        work_experience if work_experience else np.nan
    )
exec_df[['education', 'work_exp']] = exec_df['main_career'].apply(
    lambda x: pd.Series(separate_career(x))
)

# drop the old career column and reorder the columns 
exec_df = exec_df.drop(columns=['main_career']).pipe(
    lambda df: df[['stock_code'] + [col for col in df.columns if col != 'stock_code']]
)

#### 2. Individual Audit Committee Membership

The following code block extracts audit committee membership and auditor status from individual **exec_df** rows, in order for proper counting in **summary_df**.

In [22]:
def _clean_and_split(responsibility_string):
    if not isinstance(responsibility_string, str):
        return []
    
    # split the string by newlines, strip whitespace, and filter out empty strings
    return [item.strip() for item in responsibility_string.split('\n') if item.strip()]

exec_df['responsibilities'] = exec_df['responsibilities'].apply(_clean_and_split)

def is_audit_committee_member(responsibilities):
    if not responsibilities:
        return False
    # check if any item in the list matches the audit committee pattern
    for responsibility in responsibilities:
        responsibility_cleaned = re.sub(r'\s', '', responsibility)
        if re.search(r'감사위원회위원|감사위원|감사위원장', responsibility_cleaned):
            return True
    return False

def is_auditor_exclusive(responsibilities):
    if not responsibilities:
        return False
    # first, check if the person is an audit committee member and return False if they are
    if is_audit_committee_member(responsibilities):
        return False
    
    # otherwise, check for the isolated auditor pattern
    for responsibility in responsibilities:
        responsibility_cleaned = re.sub(r'\s', '', responsibility)
        if '감사' in responsibility_cleaned and not re.search(r'감사위원회위원|감사위원', responsibility_cleaned):
            return True
    return False

# apply the functions directly, performing the cleaning inside the lambda
exec_df['is_audit_committee_member'] = exec_df['responsibilities'].apply(
    lambda x: is_audit_committee_member(_clean_and_split(x))
)

exec_df['is_auditor'] = exec_df['responsibilities'].apply(
    lambda x: is_auditor_exclusive(_clean_and_split(x))
)

#### 3. Assign Compensation

assign_compensation appends the estimated/exact reported salary based on the executive's registered status and audit committee membership. 

It prioritizes the exact labels and filters the categorization of the executive based on the status listed in the estimated groups.

In [23]:
def assign_compensation(exec_df: pd.DataFrame, salary_type_df: pd.DataFrame) -> pd.DataFrame:
    exec_df['salary'] = None
    exec_df['salary_source'] = None
    exec_df['salary_type'] = None

    for idx, row in exec_df.iterrows():
        corp_code = row['corp_code']
        name = row['name']
        status = row.get('exec_status')
        is_auditor = row.get('is_auditor', False)
        is_committee = row.get('is_audit_committee_member', False)

        # 1. Try to match by name
        match = salary_separate_df[
            (salary_separate_df['corp_code'] == corp_code) & 
            (salary_separate_df['name'] == name)
        ]

        if not match.empty:
            row_data = match.iloc[0]

        else:
            if status == '미등기':
                label = '미등기임원' 
            # 2. Estimate fallback: build label
            elif is_auditor:
                label = '감사'
            elif status == '사외이사':
                label = '감사위원회 위원' if is_committee else '사외이사(감사위원회 위원 제외)'
            else:
                label = '등기이사(사외이사, 감사위원회 위원 제외)'

            group_match = salary_separate_df[
                (salary_separate_df['corp_code'] == corp_code) & 
                (salary_separate_df['name'].isna()) & 
                (salary_separate_df['position'] == label)
            ]

            row_data = group_match.iloc[0] if not group_match.empty else pd.Series(dtype='object')

        # 3. Assign if valid
        if not row_data.empty:
            exec_df.at[idx, 'salary'] = row_data.get('compensation')
            exec_df.at[idx, 'salary_source'] = row_data.get('salary_source')
            exec_df.at[idx, 'salary_type'] = row_data.get('salary_type')

    return exec_df

exec_df = assign_compensation(exec_df, salary_separate_df)

#### 4. Merge Shareholder Data

merges **exec_df** with **major_shareholder_df** to shareholding ratio. 

In [24]:
exec_df = pd.merge(exec_df, major_shareholder_df[['corp_code', 'nm', 'trmend_posesn_stock_qota_rt']], 
                   left_on=['corp_code', 'name'], right_on=['corp_code', 'nm'], how='left')
exec_df = exec_df.drop(columns=['nm']).rename(
    columns={'trmend_posesn_stock_qota_rt': 'shareholding_ratio'}
)

#### 5. Standardize Tenure

merges **exec_df** with **major_shareholder_df** to shareholding ratio. 

In [25]:
def convert_tenure_to_months(tenure_str, current_date=None):
    if pd.isna(tenure_str) or not isinstance(tenure_str, str) or not tenure_str.strip():
        return pd.NA
        
    tenure_str = tenure_str.strip()
    
    # 1. all date formats, including those with extra text
    date_match = re.search(r'(\d{2,4}[년\.\s]\d{1,2}[월]?(?:[년\.\s]\d{1,2}[일])?|\d{1,2}\.\d{1,2}\.\d{1,2})', tenure_str)
    
    if date_match:
        date_str = date_match.group(1).replace(' ', '').replace('년', '.').replace('월', '').replace('일', '')
        date_obj = pd.NaT
        
        # parse the cleaned date string with multiple formats
        date_formats = ['%Y.%m.%d', '%Y.%m', '%d.%m.%y', '%y.%m.%d', '%y.%m']
        for fmt in date_formats:
            try:
                date_obj = pd.to_datetime(date_str, format=fmt, errors='raise')
                break # Exit the loop if parsing is successful
            except (ValueError, TypeError):
                continue
    
        if pd.notna(date_obj):
            if current_date is None:
                current_date = datetime.now()
            
            total_months = (current_date.year - date_obj.year) * 12 + (current_date.month - date_obj.month)
            return float(max(0, total_months))
    
    # 2. decimal years (e.g., '11.3년', '22.5')
    match_deci = re.search(r'^(\d+(?:\.\d+)?)(?:년)?$', tenure_str)
    if match_deci:
        decimal_years = float(match_deci.group(1))
        return decimal_years * 12
    
    # 3. years and months (e.g., "3년 6개월", "4년4개월")
    match_ym = re.search(r'(\d+)\s*년(?:[^\d]+)?\s*(\d+)\s*개월', tenure_str)
    if match_ym:
        years = int(match_ym.group(1))
        months = int(match_ym.group(2))
        return float(years * 12 + months)
        
    # 4. years only (e.g., "3년")
    match_y = re.search(r'^(\d+)\s*년$', tenure_str)
    if match_y:
        years = int(match_y.group(1))
        return float(years * 12)
        
    # 5. months only (e.g., "18개월")
    match_m = re.search(r'^(\d+)\s*개월$', tenure_str)
    if match_m:
        months = int(match_m.group(1))
        return float(months)
    
    return pd.NA

exec_df['employment_period'] = exec_df['employment_period'].apply(
    lambda x: convert_tenure_to_months(x, current_date=reference_date) 
)

#### **summary_df**

#### 1. Build Initial summary_df

groups and summarizes **exec_df** data.

In [None]:
def extract_summary_optimized(group):
    voting_directors_group = group[~group['exec_status'].isin(['미등기', '감사'])]
    female_voting = (voting_directors_group['gender'] == '여').sum()
    male_voting = (voting_directors_group['gender'] == '남').sum()

    return pd.Series({
        'audit_committee': group['is_audit_committee_member'].sum(),
        'audit_committee_ods': ((group['is_audit_committee_member']) & (group['exec_status'] == '사외이사')).sum(),
        'inside_directors': group['exec_status'].isin(['사내이사', '대표집행임원']).sum(),
        'outside_directors': (group['exec_status'] == '사외이사').sum(),
        'female_voting': female_voting,
        'male_voting': male_voting,
        'voting_directors': female_voting + male_voting,
        'other_non_exec_directors': (group['exec_status'] == '기타비상무이사').sum(),
        'auditors': group['is_auditor'].sum(),
        'non_registered': (group['exec_status'] == '미등기').sum()
    })

summary_df = exec_df.groupby(['stock_code', 'corp_code', 'corp_name'], as_index=False).apply(
    extract_summary_optimized, include_groups=False
)

summary_df = pd.merge(
    summary_df,
    assets_df[['corp_code', '2024_total_assets', '2023_total_assets', '2022_total_assets']],
    on='corp_code',
    how='left'
)

disclosure = exec_df.groupby('corp_code')['disclosure'].max().reset_index()
summary_df = pd.merge(
    summary_df,
    disclosure,
    on='corp_code',
    how='left'
)

#### 2. Audit Committee Checks

For **summary_df**, each corp goes through two rounds of governance checks (*audit_committee_compliance*). After the first pass, to make sure that the flagged corporations are due to actual discrepencies and not incomplete data that was originally reported and pulled from OPENDART's exec_df (where for example, an outside committee member is not listed as an executive but does hold an active position) will parse the corp's financial statement, correct missing data, and run through the check again. To access the audit committee details directly, **missing_acm_urls(flagged_df)** will pull the relevant url from opendart reader's subdocs function. The direct link to the audit committee file will then be parsed through in **parse_and_update_audit_members(audit_targets_df, exec_df, summary_df)** function. If the **summary_df** data on audit committee size and membership doesn't match what's listed on the financial document directly, the function will update **summary_df**. On the second pass, corps that fail the governance checks will be flagged, alongside their failed condition. 

##### 2A. check_governance_compliance

runs checks on **summary_df**. Checks for the following: 

1. Mandated Audit Committee: If a corporation's total assets > 2T KRW, an audit committee exists. As corporations have a 2 year grace period, the function passes in the total asset value from 2 years prior. If a corporation has no reported total assets for that year, it falls back back on the year after. 
2. Outside Majority: If a mandated audit committee exists, outside directors must make up a majority of the acting members. 
3. 

In [76]:
def audit_committee_compliance(df):
    
    # 1. prioritize total assets from 2 years ago, otherwise use the first available to identify large corps
    df['total_assets'] = df['2022_total_assets'].fillna(df['2023_total_assets']).fillna(df['2024_total_assets'])
    is_large_company = df['total_assets'] > 2_000_000_000_000
    
    large_corps = df.loc[is_large_company].copy()
    
    # 2. check compliance rules for large corps 

    # flag 1: if there is no audit committee, or if the audit committee has less than 3 members 
    large_corps['audit_committee_fail'] = (large_corps['audit_committee'] < 3)
    
    # flag 2: outside directors don't make up a 2/3 majority of the audit committee
    large_corps['committee_majority_fail'] = (
        large_corps['audit_committee_ods'] <= (2/3) * large_corps['audit_committee']
    )
    
    # flag 3: there are less than 3 outside directors in total
    large_corps['outside_minimum_fail'] = (
        large_corps['outside_directors'] < 3
    )
    
    def get_failure_messages(row):
        messages = []
        if pd.notna(row['audit_committee_fail']) and row['audit_committee_fail']:
            messages.append(f"Audit Committee has fewer than 3 members ({row['audit_committee']}) or none listed.")
        if pd.notna(row['committee_majority_fail']) and row['committee_majority_fail']:
            messages.append(f"Audit Committee Outside Directors ({row['audit_committee_ods']}) < 2/3 of Audit Committee ({row['audit_committee']}).")
        if pd.notna(row['outside_minimum_fail']) and row['outside_minimum_fail']:
            messages.append(f"Fewer than 3 Outside Directors ({row['outside_directors']}).")
        return "; ".join(messages)

    large_corps['flags'] = large_corps.apply(get_failure_messages, axis=1)
    
    return large_corps.loc[large_corps['flags'] != ''].drop(columns=[
        'total_assets', 'audit_committee_fail', 'committee_majority_fail',
        'outside_minimum_fail'
    ])

In [77]:
flagged = audit_committee_compliance(summary_df) 

In [79]:
def missing_acm_urls(flagged_df):
    results = []

    for idx, row in flagged_df.iterrows():
        company = row['corp_name']
        corp = row['corp_code']
        rcp = row['disclosure'] # update to append the disclosure from exec_df most recent 

        try:
            subdocs = dart.sub_docs(str(rcp))  # rcept_no must be string
            match = subdocs[subdocs['title'].str.contains("감사제도에 관한 사항")]

            if not match.empty:
                url = match.iloc[0]['url']
            else:
                url = None

        except Exception as e:
            print(f"Failed to fetch for {corp} ({rcp}): {e}")
            url = None

        results.append({
            'corp_code': str(corp),
            'corp_name': company,
            'rcept_no': rcp,
            'url': url,
        })
        
        time.sleep(0.7)

    return pd.DataFrame(results)

In [80]:
audit_targets_df = missing_acm_urls(flagged).drop_duplicates(subset=['corp_code'])

In [82]:
def parse_and_update_audit_members(audit_targets_df, exec_df, summary_df):
    new_execs_to_add = []
    summary_updates = {}

    for idx, row in audit_targets_df.iterrows():
        corp_code = str(row['corp_code'])
        company = row['corp_name']
        url = row['url']

        if pd.isna(url) or not isinstance(url, str):
            continue

        try:
            response = requests.get(url, timeout=20)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')

            members_found = []
            auditors_found = []
            
            # 1. find a table under a specific title
            ac_header_patterns = [
                re.compile(r'감사위원\s*현황'),
                re.compile(r'감사위원회\s*위원의\s*인적사항'),
                re.compile(r'감사위원회\s*위원'),
                re.compile(r'감사기구\s*관련\s*사항')
            ]
            
            found_ac_section = None
            ac_table = None
            for pattern in ac_header_patterns:
                found_ac_section = soup.find(string=pattern)
                if found_ac_section:
                    # find the first table under the header 
                    ac_table = found_ac_section.find_next('table')
                    if ac_table:
                        break
            
            if ac_table:
                # parse the target table for columns indicating name and whether or not the listed exec is an outside director
                headers = [th.get_text(strip=True).replace('\xa0', '').replace('\n', '') for tr in ac_table.find_all('tr', limit=2) for th in tr.find_all(['th', 'td'])]
                name_idx = next((i for i, h in enumerate(headers) if '성명' in h), None)
                outside_idx = next((i for i, h in enumerate(headers) if '사외이사' in h), None)
                
                if name_idx is not None and outside_idx is not None:
                    # parse the rows of the table
                    data_rows = ac_table.find_all('tbody')[0].find_all('tr') if ac_table.find('tbody') else ac_table.find_all('tr')[len(ac_table.find_all('tr', limit=2)):]
                    for tr in data_rows:
                        tds = tr.find_all(['td', 'th'])
                        if len(tds) > max(name_idx, outside_idx):
                            name = tds[name_idx].get_text(strip=True)
                            is_outside = tds[outside_idx].get_text(strip=True)
                            if name and name != '-' and '---' not in name:
                                is_outside_flag = '예' in is_outside or 'O' in is_outside
                                members_found.append({'name': name, 'is_outside': is_outside_flag})
      
                                        
            # make updates based on extracted info 
            if members_found:
                total_members = len(members_found)
                outside_members = sum(1 for member in members_found if member['is_outside'])
                
                for member in members_found:
                    name = member['name']
                    existing_mask = (exec_df['corp_code'] == corp_code) & (exec_df['name'] == name)
                    if not exec_df[existing_mask].empty:
                        exec_df.loc[existing_mask, 'is_audit_committee_member'] = True
                    else:
                        new_execs_to_add.append({
                            'corp_code': corp_code, 'corp_name': company, 'name': name,
                            'chrg_job': '감사위원회 위원', 'is_audit_committee_member': True
                        })
                
                summary_updates[corp_code] = {
                    'audit_committee': total_members, 'audit_committee_ods': outside_members,
                }
            else:
                summary_updates[corp_code] = {
                    'audit_committee': 0, 'audit_committee_ods': 0,
                }

        except Exception as e:
            print(f"Exception occurred for {corp_code} - {company}: {e}")
        time.sleep(0.7)
    
    for corp_code, update in summary_updates.items():
        summary_df.loc[summary_df['corp_code'] == corp_code, 'audit_committee'] = update['audit_committee']
        summary_df.loc[summary_df['corp_code'] == corp_code, 'audit_committee_ods'] = update['audit_committee_ods']
        
    return exec_df, summary_df

In [83]:
exec_df, summary_df = parse_and_update_audit_members(
    audit_targets_df,
    exec_df,
    summary_df
)

updated_flags = audit_committee_compliance(summary_df)

In [None]:
output_folder = os.path.join('..', 'data', 'processed')
exec_file_path = os.path.join(output_folder, 'exec_df.csv')
summary_file_path = os.path.join(output_folder, 'summary_df.csv')

os.makedirs(output_folder, exist_ok=True)

exec_df.to_csv(exec_file_path, index=False, encoding='utf-8-sig')
summary_df.to_csv(summary_file_path, index=False, encoding='utf-8-sig')

In [None]:
# last to do: 