[DP-381] 한국어 토크나이저 + TF-IDF 키워드 추출 구현 by suheon98 · Pull Request #53 · Devpick-Org/devpick-ai

suheon98 · 2026-04-24T01:04:55Z

Summary

KoreanTokenizer: kiwipiepy 형태소 분석(NNG/NNP/SL), 불용어·짧은 토큰·숫자 토큰 제거
TfidfAnalyzer: scikit-learn TfidfVectorizer(1~2gram, top 30) 기반 키워드·점수 추출, cold start fallback 포함
app/core/data/stopwords_ko.txt: 한국어 개발 블로그 특화 불용어 ~350개 (일반명사·필러·영어 일반어)
requirements.txt: kiwipiepy, scikit-learn 추가

pytest tests/test_trend_tokenizer.py — 5개 (POS 필터, 숫자 제거, 불용어, 빈 입력, 공백 조인)
pytest tests/test_trend_tfidf.py — 5개 (top_n, 정렬, 빈 입력, cold start fallback, 튜플 형식)
pytest -q 전체 382개 통과 확인
ruff check . && black --check . 통과

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(trend): 한국어 토크나이저 + TF-IDF 키워드 추출 구현 (DP-381)

a0e5843

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

suheon98 merged commit 827297d into developV2 Apr 24, 2026
1 check passed