feat(diagnosis): add LLM-powered dataset analysis with redesigned UI#4
Merged
Conversation
…e diagnosis - Add new Pydantic response models: EncodingInfoResponse, ParsingIntegrityResponse, NumericStatsResponse, CategoricalStatsResponse, DateStatsResponse, ColumnStatisticsResponse - Extend FileDiagnosisResponse with encoding, parsing_integrity, column_statistics, sample_rows - Update diagnose_files endpoint handler to map all new fields from dataclass to response - Add _log_diagnosis_result() method for detailed diagnosis logging - Add chardet dependency for encoding detection Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add DatasetAnalyzingView component with 9-step analysis flow - Extend FileDiagnosis types with encoding, parsing_integrity, column_statistics - Integrate analyzing step between preview and diagnose in AddDatasetModal - Display per-file information for each analysis step (files, parsing, columns, quality, statistics) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add LLMAnalysisResult, PotentialItem, IssueItem dataclasses - Create llm_analysis_service.py with batch processing support - Add dataset_analysis_prompt.md for LLM prompts - Update FileDiagnosis to include llm_analysis field - Add diagnose_files_with_llm() async method - Update API endpoint to async and include LLM response models - Add OPENAI_API_KEY fallback in providers.py - Add unit tests for LLM analysis and caching Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add include_llm parameter to diagnose endpoint for optional LLM analysis - Implement two-phase UI: fast diagnosis first, then LLM analysis - Update DatasetAnalyzingView to handle phased step progression - Add LLM analysis types to frontend API Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use refs instead of state for phase tracking to avoid useEffect cleanup when diagnosisResults updates from second API call. Add dataArrived state to detect first data arrival without re-triggering on subsequent updates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add new `services/llm` module with centralized LLM configuration - LLMSettings resolves settings with DB > ENV > default priority - LLMService provides get_chat_model(), complete(), and complete_structured() - Pydantic schemas for type-safe structured LLM responses - Refactor deep agent to use LLMService instead of inline OpenAI config - Refactor llm_analysis_service to use complete_structured() instead of manual JSON parsing - Deprecate old get_llm_provider() with migration guide Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add POST /files/count-duplicates API endpoint - Add count_cross_file_duplicates method using DuckDB UNION ALL + DISTINCT - Skip calculation when total rows exceed 100,000 (returns skipped=true) - Call API automatically when file schemas match during diagnosis - Pass duplicateInfo to DiagnosisResultView (UI display in future PR) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When multiple files with identical schemas are being merged, the LLM now generates a unified suggested name and context description for the combined dataset. This helps users understand what the merged result will represent. Changes: - Add MergeContext and MergedAnalysis dataclasses for merge info - Update diagnose_files_with_llm to accept merge_context parameter - Extend LLM prompt to handle merge context and generate merged analysis - Add merged_suggested_name and merged_context to BatchAnalysisSchema - Update frontend to pass duplicate info as merge context to LLM - Display merged dataset suggestion in DiagnosisResultView Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add StatusBadge component for ready/review status display
- Add DatasetCard component with editable dataset name
- Add AgentRecommendation component for merge suggestions
- Redesign DiagnosisResultView with new component composition
- Support custom dataset names during import (LLM suggested or user edited)
- Update header to Korean: "{N}개 파일 스캔 완료"
- Change import button text to "Create Datasets"
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Increase Phase 1 step timing (600-1000ms processing, 500ms transition) - Replace cycling message with stacking structure that accumulates - Add 16 sequential LLM waiting messages (up from 10) - Show column names instead of type counts (max 6 + N more) - Remove unused getColumnTypeSummaryForFile helper Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add stats (rows, columns, issues) to DatasetCard component - Remove FileDiagnosisCard expandable schema details - Update AgentRecommendation to use Package icon and English text - Change StatusBadge colors (emerald for ready, orange for review) - Simplify DiagnosisResultView by removing expandable sections Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Connect toggleMerge and toggleRemoveDuplicates click handlers - Add conditional styling based on checked state (dark gray when selected, light gray when unselected) - Disable remove duplicates option when merge is not selected Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace inline-flex with flex for vertical stacking - Use flex-col gap-2 for consistent spacing - Add w-fit to constrain item width to content Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove parse_llm_response tests (function removed in favor of structured output) - Update TestAnalyzeBatchWithLLM to mock LLMService instead of provider - Update TestAnalyzeDatasetsWithLLM to use BatchLLMAnalysisResult Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
DiagnosisError with "File not found" now returns 404 instead of 500. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
FluxloopAdmin
approved these changes
Jan 27, 2026
Contributor
FluxloopAdmin
left a comment
There was a problem hiding this comment.
✅ LGTM - Approved
LLM 기반 데이터셋 분석 기능이 매우 잘 설계되었습니다!
👍 Highlights
- LLMService 아키텍처: LangChain + structured output으로 일관된 응답 보장
- 2단계 진단: 빠른 기술 분석 + 선택적 LLM 분석
- Merge Context: 동일 스키마 파일 병합 시 통합 분석 제공
- DatasetAnalyzingView: 순차적 메시지 애니메이션 UX 훌륭
- 프롬프트 엔지니어링: 명확한 가이드라인
💡 향후 개선 (Optional)
- LLM 실패 시 사용자 피드백/재시도 옵션
- AgentRecommendation 접근성 개선 (실제 checkbox 사용)
- LLM_BATCH_SIZE 설정으로 조정 가능하게
대규모 기능 추가인데 구조가 깔끔합니다! 🚀
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
파일 진단 기능에 LLM 기반 데이터셋 분석을 추가하고, 스캔 결과 UI를 전면 리디자인했습니다.
Changes
Backend
llm_analysis_service.py,LLMService)chardet패키지 추가 (파일 인코딩 감지용)Frontend
DatasetCard.tsx)DatasetAnalyzingView.tsx)Test plan
🤖 Generated with Claude Code