
# Usage Guide for Korean Stopwords Dictionary

This notebook provides a quick guide on how to use the Korean stopwords dictionary provided in JSON and TXT formats. 
The dictionary is categorized into different types of stopwords (`조사`, `접속사`, `관형사`, `감탄사`, and `기타`), 
making it easy to selectively filter words in text processing tasks.



## Loading the Stopwords from JSON Format

The JSON file organizes the stopwords by category. You can load the entire dictionary and access individual categories as needed.


In [1]:

import json

# Load the stopwords JSON file
with open("stopwords_korean.json", "r", encoding="utf-8") as file:
    stopwords = json.load(file)

# Check the structure of the stopwords dictionary
print("Stopwords structure:", stopwords.keys())

# Example: Print '조사' stopwords
print("조사 stopwords:", stopwords["조사"][:10])


Stopwords structure: dict_keys(['조사', '접속사', '관형사', '감탄사', '기타'])
조사 stopwords: ['가', '각', '과', '나', '너', '너희', '니', '대', '데', '도']



## Loading the Stopwords from TXT Format

The TXT file is organized with category headers marked by `#`. Each word is listed line-by-line under its category.
This example shows how to load and structure the TXT file for use in your code.


In [2]:

# Load the stopwords from the TXT file
stopwords_txt = {}

with open("stopwords_korean.txt", "r", encoding="utf-8") as file:
    current_category = None
    for line in file:
        line = line.strip()
        if line.startswith("#"):
            # Set the current category
            current_category = line[1:].strip()
            stopwords_txt[current_category] = []
        elif line:
            # Append the word to the current category
            stopwords_txt[current_category].append(line)

# Example: Print first few '조사' stopwords
print("조사 stopwords (TXT):", stopwords_txt["조사"][:10])


조사 stopwords (TXT): ['가', '각', '과', '나', '너', '너희', '니', '대', '데', '도']



## Example: Removing Stopwords from Text

The following function, `remove_stopwords`, takes a text input and removes any stopwords based on specified categories.
By default, it removes stopwords from all categories, but you can customize it by specifying the categories you want to filter.


In [3]:

def remove_stopwords(text, stopwords_dict, categories=["조사", "접속사", "관형사", "감탄사", "기타"]):
    # Remove stopwords from text based on specified categories.
    # Parameters:
    #     text (str): The input text to filter.
    #     stopwords_dict (dict): Dictionary of stopwords.
    #     categories (list): List of stopword categories to filter out.
    # Returns:
    #     str: The filtered text without stopwords.

    words = text.split()
    filtered_words = [word for word in words if all(word not in stopwords_dict[cat] for cat in categories)]
    return " ".join(filtered_words)

# Example usage
sample_text = "이 텍스트는 예시로 작성되었습니다."
filtered_text = remove_stopwords(sample_text, stopwords)
print("Original text:", sample_text)
print("Filtered text:", filtered_text)


Original text: 이 텍스트는 예시로 작성되었습니다.
Filtered text: 텍스트는 예시로 작성되었습니다.
