### Segmentation

1. Single Request 
    
    Demo to test changes to one search. 

2. Page Navigation Search ~ 16 minutes

    Goes through only the existing reports (i.e. the entries that show when '기업지배구조 보고서 공시' is inputted)

    Still Faulty: 
    - in switching to the most recent/revised document on open (tried sending click to dropdown but this never updated)
    - in detecting last page of the disclosure search viewer (when it hits page 38, it doesn't break out of the loop)

    Potential Fix: 
    - Incorporate logic that checks the '번호' number and stops once it hits 1, since they're listening in decending order by date. 
    - OR have a condition before each report link is opened to see if the company already exists as a key (ie a more recent report was already scraped) - in which case, pass and ignore (handles both issues as long as it checks company not submission key)

3. KOSPI Search ~ 31 minutes

    Passes in the full list of kospi codes extracted from OpenDART. 

    Pro: Since each search is ordered from most recent, the function opens the first (most updated) disclosure without going through extra documents.

    Con: While not every listed company has published a report, the functions still passes in each code - which results in a longer runtime 

In [2]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup
from io import StringIO
import pandas as pd
import pickle
import time
import json

In [2]:
kospi_df = pd.read_csv('kospi_company_info.csv')
kospi_codes = kospi_df['stock_code']

### Singular Search

In [None]:
code = '014830'

In [3]:
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 15)
driver.get('https://kind.krx.co.kr/disclosure/searchdisclosurebycorp.do?method=searchDisclosureByCorpMain')

original_window = driver.current_window_handle

In [15]:
time.sleep(.2)
report_element = wait.until(EC.presence_of_element_located((By.ID, 'AKCKwd')))
report_element.clear()
report_element.send_keys(code)

# input the start date
time.sleep(.2)
date_element = wait.until(EC.element_to_be_clickable((By.ID, 'fromDate')))
date_element.clear()
date_element.send_keys('2025-01-01')
time.sleep(.2)

# input the target document title (Disclosure of Corporate Governance Report)
report_element = wait.until(EC.presence_of_element_located((By.ID, 'reportNmTemp')))
report_element.send_keys('기업지배구조 보고서 공시')

# click on the search button
search_element = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'a.btn-sprite.search-btn')))
search_element.click()

# wait for the search results to load
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'tbody tr')))

# extract the table from the first search result, which will be the most recent report
disclosure_link = wait.until(EC.element_to_be_clickable(((By.XPATH, "//a[contains(text(), '기업지배구조 보고서 공시')]"))))
disclosure_link.click()

In [None]:
all_window_handles = driver.window_handles
for handle in all_window_handles:
    if handle != original_window:
        driver.switch_to.window(handle)
        break

iframe = wait.until(EC.presence_of_element_located((By.ID, "docViewFrm")))
driver.switch_to.frame(iframe)

css_selector = 'td.single-textbox.bg_percent'

try:
    element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, css_selector)))
    report_value = element.text

except Exception as e:
    print(f"Error: Element not found or not present. {e}")

try:
    css_selector = 'table-group[aclass="krx-cg_VotingResultsOfTheGeneralMeetingOfShareholdersAbstract"] table.fact-table'
    fact_table_element = wait.until(
        EC.presence_of_element_located((By.CSS_SELECTOR, css_selector))
    )

    table_html_string = fact_table_element.get_attribute('outerHTML')

    # Use BeautifulSoup to parse for headers.
    soup = BeautifulSoup(table_html_string, 'html.parser')
    scraped_headers = [th.get_text(strip=True) for th in soup.find_all('th')]

    # Use pandas to read the table from the HTML string.
    dfs = pd.read_html(table_html_string, header=None)

    if dfs:
        df = dfs[0]

        # Define final headers with the first two columns.
        final_headers = ['총회', '의안'] + scraped_headers[1:]
        
        # Clean the DataFrame to match the number of headers.
        df = df.iloc[1:]
        df.reset_index(drop=True, inplace=True)
        
        # Rename columns with the new headers.
        if len(final_headers) == len(df.columns):
            df.columns = final_headers
            print(f"Successfully extracted and processed table.")
        else:
            print(f"Error: The number of columns ({len(df.columns)}) does not match the number of headers ({len(final_headers)}).")
    else:
        print("No tables found in the HTML.")
        
except Exception as e:
    print(f"Error extracting table.")
    
finally:
    # Switch back to the main document from the iframe, regardless of success or failure.
    driver.switch_to.default_content()

In [17]:
df

Unnamed: 0,총회,의안,결의 구분,회의 목적사항,가결 여부,의결권 있는 발행주식 총수(1),(1) 중 의결권 행사 주식수,찬성주식수,찬성 주식 비율 (%),반대 기권 등 주식수,반대 기권 등 주식 비율 (%)
0,제38기 정기 주주총회,제2-1호 의안,특별(Extraordinary),이사의 인원수 명확화,가결(Approved),107856043,91736706,91628348,99.9,108358.0,0.1
1,제38기 정기 주주총회,제2-2호 의안,특별(Extraordinary),감사위원 선임 관련 조문 정비,가결(Approved),107856043,91736706,77909093,84.9,13827613.0,15.1
2,제38기 정기 주주총회,제2-3호 의안,특별(Extraordinary),대표이사 사장 선임 방법 명확화,가결(Approved),107856043,74689281,53946867,72.2,20742414.0,27.8
3,제38기 정기 주주총회,제2-4호 의안,특별(Extraordinary),분기배당기준일 변경,가결(Approved),107856043,91736706,91626520,99.9,110186.0,0.1
4,제38기 정기 주주총회,제3호 의안,보통(Ordinary),사내이사 이상학 선임의 건,가결(Approved),107856043,91736706,90539809,98.7,1196897.0,1.3
5,제38기 정기 주주총회,제4-1호 의안,보통(Ordinary),사외이사 손관수 선임의 건,가결(Approved),107856043,91736706,89269740,97.3,2466966.0,2.7
6,제38기 정기 주주총회,제4-2호 의안,보통(Ordinary),사외이사 이지희 선임의 건,가결(Approved),107856043,91736706,90424942,98.6,1311764.0,1.4
7,제38기 정기 주주총회,제5호 의안,보통(Ordinary),감사위원회 위원 손관수 선임의 건,가결(Approved),107856043,74689281,73254834,98.1,1434447.0,1.9
8,제38기 정기 주주총회,제6호 의안,보통(Ordinary),이사 보수한도 승인의 건,가결(Approved),107856043,87807386,87469388,99.6,337998.0,0.4
9,제37기 정기 주주총회,제1호 의안,보통(Ordinary),제37기 재무제표 및 이익잉여금처분계산서 승인의 건,가결(Approved),112809923,87368552,83585189,95.7,3783363.0,4.3


In [18]:
ranked_voting = df[df['반대 기권 등 주식수'].isna()]
majority_voting = df[df['반대 기권 등 주식수'].notna()]

In [None]:
majority_voting

In [20]:
ranked_voting

Unnamed: 0,총회,의안,결의 구분,회의 목적사항,가결 여부,의결권 있는 발행주식 총수(1),(1) 중 의결권 행사 주식수,찬성주식수,찬성 주식 비율 (%),반대 기권 등 주식수,반대 기권 등 주식 비율 (%)
16,제37기 정기 주주총회,제3-1호 의안,보통(Ordinary),대표이사 사장 방경만 선임의 건 (KT&G 이사회 안),가결(Approved),112809923,165207264,84097688,50.9,,0.0
17,제37기 정기 주주총회,제3-2호 의안,보통(Ordinary),사외이사 임민규 선임의 건 (KT&G 이사회 안),부결(Not approved),112809923,165207264,24505618,14.8,,0.0
18,제37기 정기 주주총회,제3-3호 의안,보통(Ordinary),사외이사 손동환 선임의 건 (주주제안_중소기업은행),가결(Approved),112809923,165207264,56603958,34.3,,0.0


### Page Navigation Search

In [13]:
def scrape_all_reports(driver):
    all_reports_data = []
    # max delay wait for elements to load in 
    wait = WebDriverWait(driver, 3)
    
    # main search page 
    driver.get('https://kind.krx.co.kr/disclosure/searchdisclosurebycorp.do?method=searchDisclosureByCorpMain')
    original_window = driver.current_window_handle
    
    # set initial parameters (date, report name)
    try:
        time.sleep(.2)
        date_element = wait.until(EC.element_to_be_clickable((By.ID, 'fromDate')))
        date_element.clear()
        date_element.send_keys('2025-01-01')
        time.sleep(.2)
        
        report_element = wait.until(EC.presence_of_element_located((By.ID, 'reportNmTemp')))
        report_element.send_keys('기업지배구조 보고서 공시')
        
        search_element = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'a.btn-sprite.search-btn')))
        search_element.click()
    except Exception as e:
        return all_reports_data

    # search loop to parse through each listed report 
    while True:
        try:
            print("Scraping current page...")
            
            # find all report links on the current page 
            # do this every time it navigates to a new page because doing so makes the previous elements 'stale'
            page_entries = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//a[contains(text(), '기업지배구조 보고서 공시')]")))
            
            for entry_index, _ in enumerate(page_entries):
                try:
                    # search the element again to avoid StaleElementReferenceException
                    current_entries = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//a[contains(text(), '기업지배구조 보고서 공시')]")))
                    entry_to_click = current_entries[entry_index]
                
                    old_window_handles = set(driver.window_handles)
                    entry_to_click.click()
                    
                    # wait for the new window to appear, then switch 
                    wait.until(EC.new_window_is_opened(old_window_handles))
                    all_window_handles = driver.window_handles
                    new_window_handle = [handle for handle in all_window_handles if handle not in old_window_handles]

                    if new_window_handle:
                        driver.switch_to.window(new_window_handle[0])
                        
                        iframe = wait.until(EC.presence_of_element_located((By.ID, "docViewFrm")))
                        driver.switch_to.frame(iframe)

                        report_value, df = None, None
                        try:
                            # scrape the report value 
                            report_value_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'td.single-textbox.bg_percent')))
                            report_value = report_value_element.text
                        except Exception as e:
                            print("Report value not found.")
                            pass

                        try:
                            # scrape the voting results table
                            css_selector = 'table-group[aclass="krx-cg_VotingResultsOfTheGeneralMeetingOfShareholdersAbstract"] table.fact-table'
                            fact_table_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, css_selector)))
                            table_html_string = fact_table_element.get_attribute('outerHTML')

                            # convert to StringIO to avoid error 
                            dfs = pd.read_html(StringIO(table_html_string))
                            if dfs:
                                df = dfs[0]
                        except Exception as e:
                            print("Fact table element not found.")
                            pass

                        all_reports_data.append({'report_value': report_value, 'df': df})
                        
                        driver.close()
                        driver.switch_to.window(original_window)
                    
                except Exception as e:
                    continue

            next_page_link = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.next")))
            next_page_link.click()
            print("Next page.")
            
        except Exception as e:
            break

    return all_reports_data

if __name__ == '__main__':
    try:
        driver = webdriver.Chrome()
        nav_search_governance_data = scrape_all_reports(driver)
        print(f"Total reports found: {len(nav_search_governance_data)}")
    except Exception as e:
        print(f"An error occurred during script execution: {e}")
    finally:
        if 'driver' in locals() and driver:
            driver.quit()

Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Next page.
Scraping current page...
Nex

In [None]:
# to save nested dict using pickle package (serializes nested elements)

with open('nav_search_governance_data.pkl', 'wb') as f:
    pickle.dump(nav_search_governance_data, f)

In [None]:
# to read in 

with open('nav_search_governance_data.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

### KOSPI Search

In [41]:
def scrape_governance_data(driver, codes_list):
    results_dict = {}
    wait = WebDriverWait(driver, 1)
    
    for code in codes_list:    
        driver.get('https://kind.krx.co.kr/disclosure/searchdisclosurebycorp.do?method=searchDisclosureByCorpMain')
        original_window = driver.current_window_handle
        try:            
            # useing javascript to set the report code directly
            report_element = wait.until(EC.presence_of_element_located((By.ID, 'AKCKwd')))
            driver.execute_script("arguments[0].value = arguments[1];", report_element, code)
            
            time.sleep(.2)

            # send start date 
            date_element = wait.until(EC.element_to_be_clickable((By.ID, 'fromDate')))
            date_element.clear()
            date_element.send_keys('2025-01-01')
                        
            time.sleep(.2)

            # send report name request 
            report_element = wait.until(EC.presence_of_element_located((By.ID, 'reportNmTemp')))
            report_element.send_keys('기업지배구조 보고서 공시')
    
            # click on the search button 
            search_element = driver.find_element(By.CSS_SELECTOR, 'a.btn-sprite.search-btn')
            search_element.click()
            
            time.sleep(.2)
            try:
                disclosure_link = driver.find_element(By.XPATH, "//tbody/tr[1]//a[contains(text(), '기업지배구조 보고서 공시')]")
                disclosure_link.click()
            except Exception as e:
                print(f"No search results or link found for code {code}. Skipping...")
                continue 

            # wait for the new window to appear before moving driver 
            wait.until(EC.number_of_windows_to_be(2))
            all_window_handles = driver.window_handles
            new_window_handle = [handle for handle in all_window_handles if handle != original_window]

            if new_window_handle:
                driver.switch_to.window(new_window_handle[0])
                print(f'Processing code {code}, new window: {driver.title}')
                
                # switch to iframe (embedded html element)
                iframe = wait.until(EC.presence_of_element_located((By.ID, "docViewFrm")))
                driver.switch_to.frame(iframe)

                # initialize report value and df 
                report_value = None
                df = None
                
                try:
                    report_value_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'td.single-textbox.bg_percent')))
                    report_value = report_value_element.text
                except Exception as e:
                    print(f"Warning: Could not find report value for code {code}. Error: {e}")

                # locate voting results table
                try:
                    css_selector = 'table-group[aclass="krx-cg_VotingResultsOfTheGeneralMeetingOfShareholdersAbstract"] table.fact-table'
                    fact_table_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, css_selector)))
                    table_html_string = fact_table_element.get_attribute('outerHTML')
                    
                    soup = BeautifulSoup(table_html_string, 'html.parser')
                    scraped_headers = [th.get_text(strip=True) for th in soup.find_all('th')]
                    
                    dfs = pd.read_html(StringIO(table_html_string), header=None)
                    if dfs:
                        df = dfs[0]
                        final_headers = ['총회', '의안'] + scraped_headers[1:]
                        df = df.iloc[1:]
                        df.reset_index(drop=True, inplace=True)
                        
                        if len(final_headers) == len(df.columns):
                            df.columns = final_headers
                            print(f"Successfully extracted table for code {code}.")
                        else:
                            print(f"Column/header mismatch for code {code}.")
                            df = None
                    else:
                        print(f"No tables found for code {code}.")
                except Exception as e:
                    print(f"Error extracting table for code {code}. Error: {e}")
                    df = None
                
                results_dict[code] = {'report_value': report_value, 'df': df}
                
                # close the new window and switch back to the original
                driver.close()
                driver.switch_to.window(original_window)
            else:
                print(f"No new window opened for code {code}.")
        
        except Exception as e:
            alert = driver.switch_to.alert
            print(f"Alert on page for code {code}. Text: {alert.text}")
            alert.accept()
            print("Alert dismissed. Skipping to next code.")
        except Exception as e:
            print(f"An error occurred while processing code {code}: {e}")


    return results_dict

In [42]:
if __name__ == '__main__':
    driver = webdriver.Chrome()
    stock_codes_to_scrape = kospi_codes
    governance_data = scrape_governance_data(driver, stock_codes_to_scrape)
    driver.quit()

No search results or link found for code 094800. Skipping...
No search results or link found for code 088980. Skipping...
No search results or link found for code 105840. Skipping...
Processing code 000490, new window: [대동] 기업지배구조 보고서 공시
Successfully extracted table for code 000490.
No search results or link found for code 001820. Skipping...
No search results or link found for code 000910. Skipping...
No search results or link found for code 049800. Skipping...
Processing code 200880, new window: [서연이화] 기업지배구조 보고서 공시
Successfully extracted table for code 200880.
No search results or link found for code 001070. Skipping...
No search results or link found for code 011420. Skipping...
No search results or link found for code 000700. Skipping...
No search results or link found for code 006340. Skipping...
Processing code 011280, new window: [태림포장] 기업지배구조 보고서 공시
Successfully extracted table for code 011280.
No search results or link found for code 014530. Skipping...
Processing code 015860

In [None]:
# to save nested dict using pickle package (serializes nested elements)

with open('kospi_search_governance_data.pkl', 'wb') as f:
    pickle.dump(governance_data, f)

In [None]:
# to read in 

with open('kospi_search_governance_data.pkl', 'rb') as f:
    loaded_data = pickle.load(f)