Skip to content

feat(diagnosis): add LLM-powered dataset analysis with redesigned UI#4

Merged
FluxloopAdmin merged 15 commits into
mainfrom
feat/dataset-llm-scan
Jan 27, 2026
Merged

feat(diagnosis): add LLM-powered dataset analysis with redesigned UI#4
FluxloopAdmin merged 15 commits into
mainfrom
feat/dataset-llm-scan

Conversation

@tendtoyj
Copy link
Copy Markdown
Collaborator

@tendtoyj tendtoyj commented Jan 27, 2026

Summary

파일 진단 기능에 LLM 기반 데이터셋 분석을 추가하고, 스캔 결과 UI를 전면 리디자인했습니다.

Changes

Backend

  • LLM 분석 서비스 신규 추가 (llm_analysis_service.py, LLMService)
  • 파일 진단 서비스에 LLM 분석 통합 (2단계 처리: 파일 스캔 → LLM 분석)
  • 데이터셋 분석용 프롬프트 템플릿 추가
  • 병합 시 중복 행 수 미리보기 기능 추가
  • chardet 패키지 추가 (파일 인코딩 감지용)

Frontend

  • 데이터셋 카드 컴포넌트 신규 추가 (DatasetCard.tsx)
  • 분석 중 화면 구현 - 파일별 진행 상태 표시 및 스태킹 애니메이션 (DatasetAnalyzingView.tsx)
  • 진단 결과 UI 간소화 - 데이터셋 통계 중심으로 재구성
  • 에이전트 추천 체크박스 인터랙션 복구
  • 파일 목록 레이아웃 세로 방향으로 변경

Test plan

  • 파일 업로드 후 스캔 진행 시 분석 애니메이션 정상 표시 확인
  • LLM 분석 결과가 데이터셋 카드에 올바르게 표시되는지 확인
  • 여러 파일 병합 시 중복 행 수 미리보기 정상 동작 확인
  • 에이전트 추천 체크박스 선택/해제 정상 동작 확인
  • 다양한 인코딩의 파일 업로드 시 정상 처리 확인

🤖 Generated with Claude Code

tendtoyj and others added 13 commits January 25, 2026 23:41
…e diagnosis

- Add new Pydantic response models: EncodingInfoResponse, ParsingIntegrityResponse,
  NumericStatsResponse, CategoricalStatsResponse, DateStatsResponse, ColumnStatisticsResponse
- Extend FileDiagnosisResponse with encoding, parsing_integrity, column_statistics, sample_rows
- Update diagnose_files endpoint handler to map all new fields from dataclass to response
- Add _log_diagnosis_result() method for detailed diagnosis logging
- Add chardet dependency for encoding detection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add DatasetAnalyzingView component with 9-step analysis flow
- Extend FileDiagnosis types with encoding, parsing_integrity, column_statistics
- Integrate analyzing step between preview and diagnose in AddDatasetModal
- Display per-file information for each analysis step (files, parsing, columns, quality, statistics)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add LLMAnalysisResult, PotentialItem, IssueItem dataclasses
- Create llm_analysis_service.py with batch processing support
- Add dataset_analysis_prompt.md for LLM prompts
- Update FileDiagnosis to include llm_analysis field
- Add diagnose_files_with_llm() async method
- Update API endpoint to async and include LLM response models
- Add OPENAI_API_KEY fallback in providers.py
- Add unit tests for LLM analysis and caching

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add include_llm parameter to diagnose endpoint for optional LLM analysis
- Implement two-phase UI: fast diagnosis first, then LLM analysis
- Update DatasetAnalyzingView to handle phased step progression
- Add LLM analysis types to frontend API

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use refs instead of state for phase tracking to avoid useEffect cleanup
when diagnosisResults updates from second API call. Add dataArrived state
to detect first data arrival without re-triggering on subsequent updates.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add new `services/llm` module with centralized LLM configuration
- LLMSettings resolves settings with DB > ENV > default priority
- LLMService provides get_chat_model(), complete(), and complete_structured()
- Pydantic schemas for type-safe structured LLM responses
- Refactor deep agent to use LLMService instead of inline OpenAI config
- Refactor llm_analysis_service to use complete_structured() instead of manual JSON parsing
- Deprecate old get_llm_provider() with migration guide

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add POST /files/count-duplicates API endpoint
- Add count_cross_file_duplicates method using DuckDB UNION ALL + DISTINCT
- Skip calculation when total rows exceed 100,000 (returns skipped=true)
- Call API automatically when file schemas match during diagnosis
- Pass duplicateInfo to DiagnosisResultView (UI display in future PR)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When multiple files with identical schemas are being merged, the LLM now
generates a unified suggested name and context description for the
combined dataset. This helps users understand what the merged result
will represent.

Changes:
- Add MergeContext and MergedAnalysis dataclasses for merge info
- Update diagnose_files_with_llm to accept merge_context parameter
- Extend LLM prompt to handle merge context and generate merged analysis
- Add merged_suggested_name and merged_context to BatchAnalysisSchema
- Update frontend to pass duplicate info as merge context to LLM
- Display merged dataset suggestion in DiagnosisResultView

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add StatusBadge component for ready/review status display
- Add DatasetCard component with editable dataset name
- Add AgentRecommendation component for merge suggestions
- Redesign DiagnosisResultView with new component composition
- Support custom dataset names during import (LLM suggested or user edited)
- Update header to Korean: "{N}개 파일 스캔 완료"
- Change import button text to "Create Datasets"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Increase Phase 1 step timing (600-1000ms processing, 500ms transition)
- Replace cycling message with stacking structure that accumulates
- Add 16 sequential LLM waiting messages (up from 10)
- Show column names instead of type counts (max 6 + N more)
- Remove unused getColumnTypeSummaryForFile helper

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add stats (rows, columns, issues) to DatasetCard component
- Remove FileDiagnosisCard expandable schema details
- Update AgentRecommendation to use Package icon and English text
- Change StatusBadge colors (emerald for ready, orange for review)
- Simplify DiagnosisResultView by removing expandable sections

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Connect toggleMerge and toggleRemoveDuplicates click handlers
- Add conditional styling based on checked state (dark gray when selected, light gray when unselected)
- Disable remove duplicates option when merge is not selected

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace inline-flex with flex for vertical stacking
- Use flex-col gap-2 for consistent spacing
- Add w-fit to constrain item width to content

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@tendtoyj tendtoyj self-assigned this Jan 27, 2026
tendtoyj and others added 2 commits January 27, 2026 20:36
- Remove parse_llm_response tests (function removed in favor of structured output)
- Update TestAnalyzeBatchWithLLM to mock LLMService instead of provider
- Update TestAnalyzeDatasetsWithLLM to use BatchLLMAnalysisResult

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
DiagnosisError with "File not found" now returns 404 instead of 500.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@FluxloopAdmin FluxloopAdmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ LGTM - Approved

LLM 기반 데이터셋 분석 기능이 매우 잘 설계되었습니다!

👍 Highlights

  • LLMService 아키텍처: LangChain + structured output으로 일관된 응답 보장
  • 2단계 진단: 빠른 기술 분석 + 선택적 LLM 분석
  • Merge Context: 동일 스키마 파일 병합 시 통합 분석 제공
  • DatasetAnalyzingView: 순차적 메시지 애니메이션 UX 훌륭
  • 프롬프트 엔지니어링: 명확한 가이드라인

💡 향후 개선 (Optional)

  1. LLM 실패 시 사용자 피드백/재시도 옵션
  2. AgentRecommendation 접근성 개선 (실제 checkbox 사용)
  3. LLM_BATCH_SIZE 설정으로 조정 가능하게

대규모 기능 추가인데 구조가 깔끔합니다! 🚀

@FluxloopAdmin FluxloopAdmin merged commit d7f9c6a into main Jan 27, 2026
1 check passed
@tendtoyj tendtoyj deleted the feat/dataset-llm-scan branch January 27, 2026 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants