Skip to content

[DP-381] 한국어 토크나이저 + TF-IDF 키워드 추출 구현#53

Merged
suheon98 merged 1 commit intodevelopV2from
feature/DP-381-trend-tokenizer-tfidf
Apr 24, 2026
Merged

[DP-381] 한국어 토크나이저 + TF-IDF 키워드 추출 구현#53
suheon98 merged 1 commit intodevelopV2from
feature/DP-381-trend-tokenizer-tfidf

Conversation

@suheon98
Copy link
Copy Markdown
Collaborator

Summary

  • KoreanTokenizer: kiwipiepy 형태소 분석(NNG/NNP/SL), 불용어·짧은 토큰·숫자 토큰 제거
  • TfidfAnalyzer: scikit-learn TfidfVectorizer(1~2gram, top 30) 기반 키워드·점수 추출, cold start fallback 포함
  • app/core/data/stopwords_ko.txt: 한국어 개발 블로그 특화 불용어 ~350개 (일반명사·필러·영어 일반어)
  • requirements.txt: kiwipiepy, scikit-learn 추가

Test plan

  • pytest tests/test_trend_tokenizer.py — 5개 (POS 필터, 숫자 제거, 불용어, 빈 입력, 공백 조인)
  • pytest tests/test_trend_tfidf.py — 5개 (top_n, 정렬, 빈 입력, cold start fallback, 튜플 형식)
  • pytest -q 전체 382개 통과 확인
  • ruff check . && black --check . 통과

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@suheon98 suheon98 merged commit 827297d into developV2 Apr 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant