日本語コーパス構文解析エージェント

Japanese Corpus Syntactic Analysis Agent

形態素解析 → 依存構文解析 → 指標計算 → 可視化
Morphological Analysis → Syntactic Dependency Parsing → Metrics Computation → Visualization

🎯 概要 / Overview

日本語: 日本語テキストに対し、語料清洗・形態素解析・依存構文解析・指標計算・可視化を全自動で実行する AI エージェントです。LLM が分析の指示・解釈を担い、MeCab+UniDic と spaCy が精密な言語解析を担当するハイブリッド構成です。

English: An AI agent that performs fully automated linguistic analysis on Japanese text: corpus cleaning, morphological analysis, syntactic dependency parsing, metrics computation, and visualization. A hybrid architecture where LLM handles analysis orchestration and interpretation, while MeCab+UniDic and spaCy handle precise linguistic processing.

🔑 主要機能 / Key Features

#	機能 / Feature	説明 / Description
1	語料清洗 / Corpus Cleaning	Unicode 正規化、ルビ記号除去、非本文行検出 Unicode normalization, ruby notation removal, non-body line detection
2	形態素解析 / Morphological Analysis	MeCab + UniDic v3.1.0 (完全版 501MB) による高精度解析 High-precision analysis with MeCab + UniDic v3.1.0 (full 501MB)
3	依存構文解析 / Dependency Parsing	spaCy GSD UD モデルによる 13 フィールド規格の依存関係抽出 13-field dependency extraction with spaCy GSD UD model
4	指標計算 / Metrics Computation	形態素・構文複雑さ・依存距離・対照の 4 次元指標体系 4-dimensional metrics: morphological, syntactic complexity, dependency distance, comparison
5	可視化 / Visualization	依存ツリー図・レーダーチャート・対照棒グラフ (PNG/PDF) Dependency tree diagrams, radar charts, comparison bar charts (PNG/PDF)

🏗 アーキテクチャ / Architecture

┌──────────────────────────────────────────────────┐
│                    LLM (交換可能)                  │
│   分析指示・結果解釈・ツール呼出の判断               │
│   Orchestrates tools, interprets results          │
└──────────┬───────────────────────────────┬────────┘
           │                               │
    ┌──────▼──────┐                 ┌──────▼──────┐
    │  MeCab +    │                 │   spaCy     │
    │  UniDic     │                 │  GSD UD     │
    │  形態素解析  │                 │  依存構文解析 │
    └──────┬──────┘                 └──────┬──────┘
           │                               │
    ┌──────▼──────────────────────────────▼──────┐
    │         指標計算 + 可視化モジュール           │
    │    Metrics Computation & Visualization      │
    │  ┌────────────┬──────────────┬───────────┐  │
    │  │ 形態素指標   │ 構文複雑さ指標 │ 依存距離指標│  │
    │  │ Morphology │ Syntactic    │ Dependency │  │
    │  │  Metrics   │ Complexity   │  Distance  │  │
    │  └────────────┴──────────────┴───────────┘  │
    └─────────────────────────────────────────────┘

日本語: LLM は交換可能なコンポーネントです。環境変数でモデルを切替可能。言語解析のコア（MeCab/spaCy/指標計算）は LLM に依存せず動作します。

English: The LLM is a swappable component. Switch models via environment variables. The linguistic analysis core (MeCab/spaCy/metrics) operates independently of the LLM.

📊 官方データとの対照 / Validation Against Reference Data

夏目漱石『こころ』第二章で検証 / Validated on Chapter 2 of Natsume Sōseki's Kokoro:

指標 / Metric	本ツール / Ours	官方値 / Reference	差異 / Diff
総形態素数 / Total morphemes	951	955	-4 (0.4%)
異なり語数 / Unique morphemes	303	303	0 ✅
動詞（自立）/ Verbs (self-standing)	86	86	0 ✅
連体詞 / Attributive words	11	11	0 ✅
固有名詞 / Proper nouns	6	6	0 ✅

日本語: 異なり語数は語彙素ID前12桁による去重で官方と完全一致。品詞分類も官方基準（動詞＝自立動詞のみ、非自立可能→その他）に準拠。

English: Unique morpheme count matches the reference exactly using lemma ID prefix-12 deduplication. POS classification follows the official standard (verbs = self-standing only; auxiliary verbs → その他/other).

📈 分析指標一覧 / Metrics Overview

形態素指標 / Morphological Metrics

延べ語数 / 異なり語数 / TTR（語彙多様性）/ Total, unique morphemes, TTR
品詞分布（UniDic 分類 + 官方風分類）/ POS distribution (UniDic + official style)

構文複雑さ指標 / Syntactic Complexity Metrics

MLS（平均文長）/ MLT（平均節長）/ Mean length of sentence / clause
DC（節密度）/ C/T / T/S / Clause density

依存距離指標 / Dependency Distance Metrics

MDD（平均依存距離）/ MDD_std / Mean dependency distance
NDD（正規化依存距離）/ Normalized dependency distance
HDD（階層依存距離）/ PDD（投影度）/ Hierarchical / projection degree

対照指標 / Cross-text Comparison Metrics

多テキスト間の全指標横断対照 / Cross-text comparison across all metrics
レーダーチャート + 棒グラフ可視化 / Radar chart + bar chart visualization

🤖 対応 LLM モデル / Supported LLM Models

環境変数 AGENT_MODEL でモデルを動的切替可能（未指定時は設定ファイルのデフォルト）。

Switch models dynamically via AGENT_MODEL env var (falls back to config file default).

Model ID	特徴 / Description
`doubao-seed-2-0-pro-260215`	旗艦モデル・複雑推論 / Flagship, complex reasoning
`doubao-seed-2-0-lite-260215`	均衡型 / Balanced performance-cost
`doubao-seed-2-0-mini-260215`	軽量高速・256k CTX / Lightweight, 256k context
`doubao-seed-1-6-251015`	多面手・デフォルト / General-purpose, default
`deepseek-r1-250528`	671B 満血推論 / Full 671B reasoning model
`kimi-k2-5-260127`	オープンソース SoTA / Open-source SoTA
`glm-5-0-260211`	Agentic Engineering 旗艦 / Agentic flagship
`qwen-3-5-plus-260215`	混合アーキテクチャ / Hybrid MoE architecture

日本語: LLM は任意の OpenAI 互換 API に対応可能です。上記は Coze プラットフォームで利用可能なモデル一覧です。

English: The LLM backend supports any OpenAI-compatible API. The list above shows models available on the Coze platform.

🚀 クイックスタート / Quick Start

前提条件 / Prerequisites

Python 3.12+
uv パッケージマネージャ / Package manager
OpenAI 互換 LLM API キー / An OpenAI-compatible LLM API key

インストール / Installation

# リポジトリをクローン / Clone repository
git clone https://github.com/Albertaworlds/Japanese-Corpus-Syntactic-Analysis-Agent.git
cd Japanese-Corpus-Syntactic-Analysis-Agent

# 依存関係をインストール / Install dependencies
uv sync

# spaCy 日本語モデルをダウンロード / Download spaCy Japanese model
python -m spacy download ja_core_news_ud

UniDic 辞書のセットアップ / UniDic Dictionary Setup

UniDic v3.1.0 完全版（~501MB）が必要です。以下のいずれかの方法で取得してください：

You need UniDic v3.1.0 (full, ~501MB). Use one of the following methods:

方法 A：pip パッケージから（推奨）/ Method A: From pip package (recommended)

uv add unidic
python -m unidic download    # ~501MB, downloads to .venv

方法 B：手動ダウンロード → プロジェクトディレクトリ / Method B: Manual download → project directory

# 辞書用ディレクトリを作成 / Create dictionary directory
mkdir -p assets/unidic_dicdir

# UniDic 3.1.0 をダウンロード・解凍して、中身を assets/unidic_dicdir/ に配置
# Download UniDic 3.1.0, extract, and place contents into assets/unidic_dicdir/
# ダウンロード元: https://unidic.ninjal.ac.jp/
# 必須ファイル: sys.dic, matrix.bin, char.bin, unk.dic, dicrc

日本語: 方法 A が最も簡単です。方法 B は .venv を再構築する環境や、辞書をプロジェクト内で一元管理したい場合に適しています。コードは assets/unidic_dicdir/ → unidic パッケージの順で検索します。

English: Method A is the simplest. Method B is suitable for environments where .venv gets rebuilt, or when you want to manage the dictionary within the project. The code searches assets/unidic_dicdir/ first, then falls back to the unidic package.

環境変数 / Environment Variables

# 必須：LLM API キー / Required: LLM API key
export COZE_WORKLOAD_IDENTITY_API_KEY="your-api-key"
export COZE_INTEGRATION_MODEL_BASE_URL="https://your-model-endpoint"

# オプション：モデル動的切替 / Optional: dynamic model switching
export AGENT_MODEL="doubao-seed-1-6-251015"

⚠️ セキュリティ / Security: API キーは絶対に Git にコミットしないでください。.env ファイルで管理し、.gitignore に .env を含めてください。

Never commit API keys to Git. Use .env files and ensure .env is in .gitignore.

実行 / Run

# HTTP サーバー起動 / Start HTTP server
bash scripts/http_run.sh -p 5000

# ローカル対話モード / Local conversation mode
bash scripts/local_run.sh -m flow

📁 プロジェクト構造 / Project Structure

.
├── config/
│   └── agent_llm_config.json       # LLM 設定 / LLM configuration
├── docs/
│   └── spec.md                     # 技術仕様書 / Technical specification
├── scripts/
│   ├── http_run.sh                 # HTTP 起動 / HTTP server launch
│   ├── local_run.sh                # ローカル実行 / Local execution
│   └── setup.sh                    # 初期セットアップ / Initial setup
├── src/
│   ├── agents/
│   │   └── agent.py                # エージェント本体 / Agent core (LangGraph)
│   ├── storage/
│   │   └── memory/                 # チェックポインタ / Checkpointer
│   ├── tools/
│   │   ├── _mecab_helper.py        # MeCab + UniDic 解析コア / Morphological core
│   │   ├── corpus_cleaner.py       # 語料清洗 / Corpus cleaning
│   │   ├── morphology_analyzer.py  # 形態素解析 / Morphological analysis
│   │   ├── japanese_parser.py      # 依存構文解析 / Dependency parsing
│   │   ├── dependency_visualizer.py    # 依存ツリー可視化 / Tree visualization
│   │   ├── comparison_visualizer.py    # 対照可視化 / Comparison visualization
│   │   ├── full_analysis_pipeline.py   # 全行程パイプライン / Full pipeline
│   │   └── skills/                 # 指標計算モジュール / Metrics modules
│   │       ├── morphology_metrics/
│   │       ├── syntactic_complexity_metrics/
│   │       ├── dependency_distance_metrics/
│   │       └── comparison_metrics/
│   ├── utils/                      # 共通ユーティリティ / Utilities
│   └── main.py                     # エントリーポイント / Entry point
└── assets/
    ├── unidic_dicdir/              # UniDic 辞書（要セットアップ）/ UniDic dict (setup required)
    └── config/                     # ツール設定 / Tool configurations

⚙️ 設定 / Configuration

`config/agent_llm_config.json`

{
  "config": {
    "model": "doubao-seed-1-6-251015",
    "temperature": 0.3,
    "top_p": 0.9,
    "max_completion_tokens": 10000,
    "timeout": 600,
    "thinking": "disabled"
  },
  "sp": "あなたは日本語コーパス構文分析の専門家です...",
  "tools": ["analysis_pipeline", "clean_japanese_corpus", ...]
}

モデル優先順位 / Model priority:

環境変数 AGENT_MODEL > config/agent_llm_config.json の model フィールド
Env var AGENT_MODEL > model field in config/agent_llm_config.json

📖 技術仕様 / Technical Specification

詳細は docs/spec.md を参照 / See docs/spec.md for details:

13 フィールド依存対規格 / 13-field dependency annotation format
4 次元指標体系の計算定義 / 4-dimensional metrics computation definitions
MeCab + UniDic 品詞分類と官方基準の対応表 / POS mapping: UniDic ↔ official standard
Skill モジュールのインターフェース仕様 / Skill module interface specification

🛠 開発 / Development

テスト / Testing

bash scripts/local_run.sh -m flow

依存関係の追加 / Adding Dependencies

uv add <package-name>

⚠️ pip install は使用しないでください。本プロジェクトは uv で管理されています。 Do not use pip install. This project is managed by uv.

📄 ライセンス / License

MIT License

🙏 謝辞 / Acknowledgements

UniDic — 国立国語研究所形態素解析辞書 / NINJAL morphological dictionary
MeCab — 形態素解析エンジン / Morphological Analyzer
spaCy — 依存構文解析 / Dependency parser
LangGraph — エージェントフレームワーク / Agent framework

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
assets		assets
config		config
docs		docs
scripts		scripts
src		src
.coze		.coze
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

日本語コーパス構文解析エージェント

Japanese Corpus Syntactic Analysis Agent

🎯 概要 / Overview

🔑 主要機能 / Key Features

🏗 アーキテクチャ / Architecture

📊 官方データとの対照 / Validation Against Reference Data

📈 分析指標一覧 / Metrics Overview

形態素指標 / Morphological Metrics

構文複雑さ指標 / Syntactic Complexity Metrics

依存距離指標 / Dependency Distance Metrics

対照指標 / Cross-text Comparison Metrics

🤖 対応 LLM モデル / Supported LLM Models

🚀 クイックスタート / Quick Start

前提条件 / Prerequisites

インストール / Installation

UniDic 辞書のセットアップ / UniDic Dictionary Setup

環境変数 / Environment Variables

実行 / Run

📁 プロジェクト構造 / Project Structure

⚙️ 設定 / Configuration

`config/agent_llm_config.json`

📖 技術仕様 / Technical Specification

🛠 開発 / Development

テスト / Testing

依存関係の追加 / Adding Dependencies

📄 ライセンス / License

🙏 謝辞 / Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

日本語コーパス構文解析エージェント

Japanese Corpus Syntactic Analysis Agent

🎯 概要 / Overview

🔑 主要機能 / Key Features

🏗 アーキテクチャ / Architecture

📊 官方データとの対照 / Validation Against Reference Data

📈 分析指標一覧 / Metrics Overview

形態素指標 / Morphological Metrics

構文複雑さ指標 / Syntactic Complexity Metrics

依存距離指標 / Dependency Distance Metrics

対照指標 / Cross-text Comparison Metrics

🤖 対応 LLM モデル / Supported LLM Models

🚀 クイックスタート / Quick Start

前提条件 / Prerequisites

インストール / Installation

UniDic 辞書のセットアップ / UniDic Dictionary Setup

環境変数 / Environment Variables

実行 / Run

📁 プロジェクト構造 / Project Structure

⚙️ 設定 / Configuration

config/agent_llm_config.json

📖 技術仕様 / Technical Specification

🛠 開発 / Development

テスト / Testing

依存関係の追加 / Adding Dependencies

📄 ライセンス / License

🙏 謝辞 / Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`config/agent_llm_config.json`

Packages