# 텍스트 데이터 전처리 실습 노트북
30가지 기법을 실습하며 배우는 텍스트 정제

In [1]:
!pip install -U -q datasets huggingface_hub fsspec


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 📦 1. 데이터 불러오기 (IMDB)

In [3]:
from datasets import load_dataset
dataset = load_dataset("stanfordnlp/imdb") # https://huggingface.co/datasets/stanfordnlp/imdb
text = dataset["train"][0]["text"]


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
import json
dataset["train"][3]

{'text': "This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2/10.",
 'label': 0}

In [6]:
print(text)

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, eve

## 🔹 텍스트 전처리 개요
- 텍스트 전처리는 자연어 처리(NLP)에서 모델 학습을 위한 데이터 품질을 높이는 핵심 전 단계입니다.
- 정제된 텍스트는 노이즈를 줄이고 모델의 일반화 성능을 향상시킵니다.
- 응용 분야: 감성 분석, 챗봇, 기계번역, 요약 등 다양한 NLP Task

📌 예시: 예시: 'I LoVe this movie!!! 😍😍😍' → 'i love this movie'

## 🔹 전처리 파이프라인
- 일반적인 순서: 텍스트 수집 → 정제 → 정규화 → 토큰화 → 정렬 및 저장
- 파이프라인 구성은 도메인(의료, SNS, 뉴스 등)에 따라 달라질 수 있음

📌 예시: IMDB 리뷰 → 소문자화 → 구두점 제거 → 불용어 제거 → Stemming → Token화

## 🔹 소문자화 (Lowercasing)
- 대문자/소문자 구분을 없애 단어 형태 통일
- 특히 영어권 언어에서는 필수 전처리 단계
- 예외 상황: 고유명사 식별이 중요한 Task에선 사용하지 않기도 함

📌 예시: 'Apple is great.' → 'apple is great.'

In [7]:
text.lower()

'i rented i am curious-yellow from my video store because of all the controversy that surrounded it when it was first released in 1967. i also heard that at first it was seized by u.s. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" i really had to see this for myself.<br /><br />the plot is centered around a young swedish drama student named lena who wants to learn everything she can about life. in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />what kills me about i am curious-yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, ev

## 🔹 구두점 제거 (Punctuation Removal)
- 문장부호는 일반적으로 의미를 가지지 않아 제거 대상
- 단, 감정 분석에서 느낌표(!)는 보존할 수도 있음

📌 예시: 'I love it!!!' → 'I love it'

In [10]:
import re
re.sub(r'[^\w\s]', '', text)

'I rented I AM CURIOUSYELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967 I also heard that at first it was seized by US customs if it ever tried to enter this country therefore being a fan of films considered controversial I really had to see this for myselfbr br The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States In between asking politicians and ordinary denizens of Stockholm about their opinions on politics she has sex with her drama teacher classmates and married menbr br What kills me about I AM CURIOUSYELLOW is that 40 years ago this was considered pornographic Really the sex and nudity scenes are few and far between even then its not shot like some 

## 🔹 불용어 제거 (Stopwords)
- 불용어는 'the', 'is', 'and'와 같이 자주 등장하지만 정보량이 적은 단어들
- 대부분의 NLP 라이브러리(nltk, spaCy)에서 지원
- 불용어 제거는 토픽 모델링, 문서 분류 등에서 유용

📌 예시: 'this is a good movie' → 'good movie'

In [None]:
>>> import nltk
>>> nltk.download('stopwords')

ModuleNotFoundError: No module named 'nltk'

In [14]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [18]:
tokens = [w for w in text.split() if w.lower() not in stop_words]
tokens[:20]

['rented',
 'CURIOUS-YELLOW',
 'video',
 'store',
 'controversy',
 'surrounded',
 'first',
 'released',
 '1967.',
 'also',
 'heard',
 'first',
 'seized',
 'U.S.',
 'customs',
 'ever',
 'tried',
 'enter',
 'country,',
 'therefore']

## 🔹 불균형 단어 제거 (빈도 기반)
- 전체 코퍼스에서 너무 자주 혹은 너무 드물게 등장하는 단어 제거
- 적용 Task: BoW, TF-IDF 기반 분류, 토픽 모델링 등
- 하위 5% 미만, 상위 1% 초과 등 통계 기반 기준 활용

📌 예시: 빈도가 높은 'movie', 'film', 낮은 'xyzabc123' 제거

In [21]:
from collections import Counter
words = text.split()
print(words[:20])


['I', 'rented', 'I', 'AM', 'CURIOUS-YELLOW', 'from', 'my', 'video', 'store', 'because', 'of', 'all', 'the', 'controversy', 'that', 'surrounded', 'it', 'when', 'it', 'was']


In [9]:
freq = Counter(words)
filtered = [w for w in words if freq[w] > 5 ]
filtered[:20]

NameError: name 'Counter' is not defined

## 🔹 Stemming (어간 추출)
- 단어에서 접미어 제거하여 어근으로 환원 (예: playing → play)
- PorterStemmer, LancasterStemmer 등이 대표적
- 의미 보존보다 형태 통일 목적

📌 예시: 'running', 'runner', 'ran' → 'run'

In [27]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(words[:20])

stemmed = [stemmer.stem(word) for word in text.split()]
stemmed[:20]

['I', 'rented', 'I', 'AM', 'CURIOUS-YELLOW', 'from', 'my', 'video', 'store', 'because', 'of', 'all', 'the', 'controversy', 'that', 'surrounded', 'it', 'when', 'it', 'was']


['i',
 'rent',
 'i',
 'am',
 'curious-yellow',
 'from',
 'my',
 'video',
 'store',
 'becaus',
 'of',
 'all',
 'the',
 'controversi',
 'that',
 'surround',
 'it',
 'when',
 'it',
 'wa']

In [33]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer(language='english')
print(words[:20])

stemmed = [stemmer.stem(word) for word in text.split()]
stemmed[:20]

['I', 'rented', 'I', 'AM', 'CURIOUS-YELLOW', 'from', 'my', 'video', 'store', 'because', 'of', 'all', 'the', 'controversy', 'that', 'surrounded', 'it', 'when', 'it', 'was']


['i',
 'rent',
 'i',
 'am',
 'curious-yellow',
 'from',
 'my',
 'video',
 'store',
 'becaus',
 'of',
 'all',
 'the',
 'controversi',
 'that',
 'surround',
 'it',
 'when',
 'it',
 'was']

## 🔹 Lemmatization (표제어 추출)
- 문맥과 품사에 따라 단어의 사전형으로 환원
- spaCy, nltk 지원. 품사 정보가 필요함
- Stemming보다 정확하지만 계산 비용 높음

📌 예시: 'better' → 'good', 'was' → 'be'

In [31]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
lemmas = [token.lemma_ for token in doc]
texts = [token.text for token in doc]
print(texts[:20])
lemmas[:20]

['I', 'rented', 'I', 'AM', 'CURIOUS', '-', 'YELLOW', 'from', 'my', 'video', 'store', 'because', 'of', 'all', 'the', 'controversy', 'that', 'surrounded', 'it', 'when']


['I',
 'rent',
 'I',
 'be',
 'CURIOUS',
 '-',
 'yellow',
 'from',
 'my',
 'video',
 'store',
 'because',
 'of',
 'all',
 'the',
 'controversy',
 'that',
 'surround',
 'it',
 'when']

## 🔹 정규화 - 날짜 및 숫자 처리
- 날짜 형식 표준화: Jan 20th → 2023-01-20
- 숫자 통일: 5k → 5000, 1M → 1000000
- 금액/단위/소수점 변환도 포함됨

📌 예시: '5k followers' → '5000 followers'

In [25]:
re.sub(r'\b(\d+)k\b', lambda m: str(int(m.group(1)) * 1000), 'She earned 5k.')

'She earned 5000.'

## 🔹 이모지 제거
- 모델 학습 시 방해가 되는 비언어적 심볼 제거
- 특히 뉴스, 법률 등 포멀 데이터에서는 반드시 제거
- emoji 패키지 사용

📌 예시: 'I love this! 😍😍' → 'I love this! '

In [36]:
!pip install -q emoji

In [42]:
import emoji
emoji_text="""I rented I AM CURIOUS-YELLOW 🎬 from my video store because of all the controversy ⚠️ that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs 🛃 if it ever tried to enter this country, therefore being a fan of films considered 'controversial' 😏 I really had to see this for myself.

The plot is centered around a young Swedish drama student named Lena 🎭 who wants to learn everything she can about life 🌍. In particular, she wants to focus her attentions on making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War 🪖 and race issues in the United States 🇺🇸. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men 🔥.

What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic 😲. Really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno 🎞️. While my countrymen might find it shocking 😳, in reality sex and nudity are a major staple in Swedish cinema 🇸🇪. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films 🎥.

I do commend 👏 the filmmakers for the fact that any sex shown in the film is shown for artistic purposes 🎨 rather than just to shock people and make money 💸 to be shown in pornographic theaters in America. *I AM CURIOUS-YELLOW* is a good film for anyone wanting to study the meat and potatoes 🍖🥔 (no pun intended) of Swedish cinema. But really, this film doesn’t have much of a plot 📄."""
cleaned = emoji.replace_emoji(emoji_text, replace='')
cleaned

"I rented I AM CURIOUS-YELLOW  from my video store because of all the controversy  that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs  if it ever tried to enter this country, therefore being a fan of films considered 'controversial'  I really had to see this for myself.\n\nThe plot is centered around a young Swedish drama student named Lena  who wants to learn everything she can about life . In particular, she wants to focus her attentions on making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War  and race issues in the United States . In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men .\n\nWhat kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic . Really, the sex and nudity scenes are few and far between, even th

## 🔹 이모지 → 단어 치환
- 감정 분석 등에서 이모지를 단어로 변환하여 정보 보존
- emoji.demojize 사용 시 텍스트화 가능
- 예: 😂 → :face_with_tears_of_joy:

📌 예시: 'I ❤️ NLP!' → 'I :red_heart: NLP!'

In [43]:
emoji.demojize("I ❤️ NLP!")

'I :red_heart: NLP!'

## 🔹 URL 제거
- 웹에서 수집한 텍스트에는 종종 광고/출처 링크 포함됨
- URL은 대부분 의미 없는 노이즈로 간주됨
- 정규식을 사용해 제거 가능

📌 예시: 'Visit http://example.com now' → 'Visit now'

In [45]:
import re
text = """
Check out our website at http://example.com for more details.
You can also visit our blog at https://blog.example.com/articles?id=123.
Don't forget to look at www.example.org for additional resources!
"""
cleaned = re.sub(r"http\S+|www\.\S+", "", text)
cleaned

"\nCheck out our website at  for more details.\nYou can also visit our blog at \nDon't forget to look at  for additional resources!\n"

## 🔹 HTML 태그 제거
- 크롤링된 뉴스, 블로그 글에 HTML이 포함될 수 있음
- 텍스트 외 정보(<div>, <a>)는 제거 대상
- BeautifulSoup 활용 가능

📌 예시: '<p>Hello</p>' → 'Hello'

In [46]:
from bs4 import BeautifulSoup
text = """
<html>
  <head><title>Example Page</title></head>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is a <b>sample</b> paragraph with <a href="https://example.com">a link</a>.</p>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
    </ul>
  </body>
</html>
"""
cleaned = BeautifulSoup(text, "html.parser").get_text()
cleaned

'\n\nExample Page\n\nWelcome to My Website\nThis is a sample paragraph with a link.\n\nItem 1\nItem 2\n\n\n\n'

## 🔹 채팅체 정규화
- SNS/메신저 특유의 줄임말, 비표준어를 정제
- 예: 'u r' → 'you are', 'idk' → 'I don't know'
- 사전 기반 또는 정규표현식 기반으로 처리

📌 예시: 'u r awesome idk why' → 'you are awesome I don't know why'

In [47]:
chat_dict = {'u': 'you', 'r': 'are', 'idk': "I don't know"}
text = 'idk what u r doing'
' '.join([chat_dict.get(w, w) for w in text.split()])

"I don't know what you are doing"

## 🔹 맞춤법 교정
- 타이핑 실수, 오타 자동 수정
- 감정 분석 등에서 의미 왜곡 방지
- PySpellChecker, GingerIt 등 사용

📌 예시: 'recieve' → 'receive'

In [55]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.3-py3-none-any.whl.metadata (9.5 kB)
Downloading pyspellchecker-0.8.3-py3-none-any.whl (7.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/7.2 MB[0m [31m3.1 MB/s[0m eta [36m0:00:03[0m[2K   [91m━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/7.2 MB[0m [31m10.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m3.9/7.2 MB[0m [31m36.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m7.2/7.2 MB[0m [31m58.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.3


In [58]:
from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)


corrected = [correct_spellings(word) for word in text.split()]
corrected[:10]

['ink', 'what', 'u', 'r', 'doing']

## 🔹 축약어 확장
- don’t → do not, I’ll → I will
- 감정/문법 보존을 위해 확장 중요
- contractions 라이브러리 활용

📌 예시: 'I can't' → 'I cannot'

In [59]:
!pip install -q contractions

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/289.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m153.6/289.9 kB[0m [31m4.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/113.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.9/113.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [63]:
import contractions
expanded = contractions.fix("I can't do this.")
expanded

'I cannot do this.'

## 🔹 숫자 → 텍스트 / 제거
- TTS, 요약, 번역 등에서는 숫자를 단어로 바꾸기도 함
- 예: '100%' → 'one hundred percent'
- 또는 숫자 자체를 제거하기도 함

📌 예시: '100%' → 'one hundred percent'

In [61]:
!pip install -q inflect

In [62]:
import inflect
p = inflect.engine()
p.number_to_words(100)

'one hundred'

## 🔹 대화체 및 말버릇 제거
- 온라인 댓글이나 SNS 텍스트에서 자주 등장하는 반복어, 감탄사 등 제거
- 예: 'ㅋㅋㅋ', '헐', 'OMG', 'Wow' 등
- 자연어 이해를 방해하거나 잡음이 될 수 있음

📌 예시: '헐ㅋㅋㅋ 진짜요??ㅋㅋ' → '헐 진짜요??'

In [64]:
import re
re.sub(r'(ㅋ|ㅎ|ㅠ|ㅜ){2,}', '', text)

'idk what u r doing'

## 🔹 정규표현식 패턴 정리
- 날짜: '\d{4}-\d{2}-\d{2}' 또는 'Jan \d{1,2}, \d{4}'
- 이메일: '\b[\w.-]+@[\w.-]+\.\w+\b'
- URL: 'http[s]?://\S+'
- 숫자: '\b\d+\b' / 통화: '\$\d+(\.\d+)?'

📌 예시: 정규식 사전 구축 → 텍스트에서 다양한 패턴 추출 및 제거 자동화

## 🔹 spaCy 기반 통합 정제
- spaCy로 불용어 제거, 표제어 추출, 알파벳 필터링을 한 번에 수행
- 영어 기준의 대표적 전처리 워크플로우 구성
- 모델: en_core_web_sm 등

📌 예시: 'He was running quickly.' → 'run quickly'

In [69]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
cleaned = " ".join([token.lemma_ for token in doc if not token.is_stop and token.is_alpha])
cleaned[:20]

'idk u r'

## 🔹 도메인별 전처리: 감성 분석
- 긍정/부정 판단 시 부정어, 이모티콘, 느낌표 등 보존 필요
- 중복 문자 강조 ('좋아아아아')는 적절히 축약
- 이모지 → 단어로 변환 후 보존 가능

📌 예시: '진짜 좋아아아아!!!😍' → '진짜 좋아!!! :smiling_face_with_heart_eyes:'

## 🔹 도메인별 전처리: 뉴스 텍스트
- 기자명, 날짜, 출처 제거 / 문단 구분 유지
- 고유명사 및 수치 표현은 보존
- 표제어화는 선택적으로 적용

📌 예시: ‘[서울=뉴스1] 홍길동 기자 = ...’ → ‘...’

## 🔹 도메인별 전처리: 트위터 데이터
- 멘션(@user), 해시태그(#topic), 줄임말(idk, brb) 처리
- RT 제거, 이모지 치환, URL 제거
- Twint 등으로 수집된 트윗 대상

📌 예시: 'RT @user: I love this! #NLP http://t.co/abc' → 'I love this!'

In [70]:
text = """
RT @user1: Check out the new features in our product! #AI #innovation http://example.com
Great job team! @dev_team Let's keep pushing forward. #motivation
Visit our homepage: https://company.org/about
"""
re.sub(r'@\w+|#\w+|RT|http\S+', '', text)

"\n : Check out the new features in our product!   \nGreat job team!  Let's keep pushing forward. \nVisit our homepage: \n"

## 🔹 도메인별 전처리: 법률 문서
- 법령명, 조항번호 등 고유 용어는 보존해야 함
- 공백, 줄바꿈 정리 및 목차/판례번호 제거
- 주로 CaseLaw, LEGISLAText 데이터셋 활용

📌 예시: '제1조(목적) 이 법은...' → '이 법은...'

## 🔹 도메인별 전처리: 의료 데이터
- 환자 이름, 병원명, 날짜 등 PHI 정보 제거
- 의학용어는 일반 사전으로 대체하지 않음 (보존)
- MIMIC-III, PubMed 데이터셋 사용

📌 예시: 'Patient: John Doe' → ''

In [72]:
import re
text = """
Name: John Doe
Patient: Jane Smith
Hospital: Seoul Medical Center
Date: 2023-10-15

Chief Complaint: Persistent cough and shortness of breath.
Diagnosis: Acute bronchitis
Medication: Amoxicillin 500mg, three times a day
Notes: Patient reports symptoms worsening over 5 days.
"""
text = re.sub(r'(Name:|Patient:|Hospital:|Date:).*', '', text)
print(text)







Chief Complaint: Persistent cough and shortness of breath.
Diagnosis: Acute bronchitis
Medication: Amoxicillin 500mg, three times a day
Notes: Patient reports symptoms worsening over 5 days.



## 🔹 사용자 정의 전처리 파이프라인 구성
- 여러 전처리 함수를 하나의 파이프라인으로 연결
- 단계별로 함수화하여 재사용 가능
- 함수 체이닝, 람다 파이프 활용

📌 예시: 'I LOVE it!!! http://abc.com' → 'i love it'

In [68]:
def clean_pipeline(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"[^\w\s]", "", text)
    return text

## 🔹 전체 전처리 예시 실행 (IMDB)
- IMDB 데이터셋 텍스트에 대해 단계별로 전처리 적용
- 전처리 전 vs 후 비교표 생성
- 토큰 수 변화, 단어 구름 시각화 포함

📌 예시: 전처리 전: 350 tokens → 전처리 후: 212 tokens (불용어 제거, 표제어화 적용)

## 🔹 전처리 성능 평가
- 모델 성능 비교 (전처리 전 vs 후)
- 지표: Accuracy, F1, BLEU 등
- 전처리별 기여도 분석 가능

📌 예시: 텍스트 분류 정확도: 전처리 전 82% → 전처리 후 88%

## 🔹 결론 및 팁 정리
- 전처리는 도메인에 따라 맞춤화 필요
- 과도한 정제는 의미 손실 → 균형 중요
- 전처리 자동화 파이프라인 설계 추천
- 활용할만한 오픈소스 도구 정리

📌 예시: - Huggingface Datasets, spaCy, NLTK, emoji, contractions, BeautifulSoup 등