Merge branch 'release/0.1.0'

Sunwood-ai-labs · Jun 8, 2024 · ccc9c1c · ccc9c1c
2 parents 8e3baee + 4cc6006
commit ccc9c1c
Show file tree

Hide file tree

Showing 15 changed files with 553 additions and 56 deletions.
diff --git a/.SourceSageignore b/.SourceSageignore
@@ -0,0 +1,38 @@
+.git
+__pycache__
+LICENSE
+output.md
+assets
+Style-Bert-VITS2
+output
+streamlit
+SourceSage.md
+data
+.gitignore
+.SourceSageignore
+*.png
+Changelog
+SourceSageAssets
+SourceSageAssetsDemo
+__pycache__
+.pyc
+**/__pycache__/**
+modules\__pycache__
+.svg
+sourcesage.egg-info
+.pytest_cache
+dist
+build
+.env
+example
+
+.gaiah.md
+.Gaiah.md
+tmp.md
+tmp2.md
+.SourceSageAssets
+tests
+template
+aira.egg-info
+aira.Gaiah.md
+README_template.md
diff --git a/.aira/config.dev.yml b/.aira/config.dev.yml
@@ -0,0 +1,69 @@
+aira:
+  gaiah:  # 共通設定
+    run: true
+    repo:
+      repo_name: "PEGASUS"
+      description: "Evolutionary Merge Experiment"
+      private: false
+    local:
+      repo_dir: "C:/Prj/PEGASUS"
+      no_initial_commit: false
+    commit:
+      commit_msg_path: ".Gaiah.md"
+      branch_name: null
+
+    dev:  # 開発時の設定 (必要に応じて上書き)
+      repo:
+        create_repo: false
+      local:
+        init_repo: false
+      commit:
+        process_commits: true
+
+    init:  # 初期化時の設定 (必要に応じて上書き)
+      repo:
+        create_repo: true
+      local:
+        init_repo: true
+      commit:
+        process_commits: false
+
+  llm:
+    model: "gemini/gemini-1.5-pro-latest"  # 利用するLLMモデル
+
+  repository_summary_output_dir: .aira  # リポジトリ概要の出力ディレクトリ
+  readme_prompt_template_path: .aira/readme_prompt_template.txt  # README生成のプロンプトテンプレートのパス
+
+  harmon_ai:
+    run: true
+    environment:
+      repo_name: "PEGASUS"
+      owner_name: "Sunwood-ai-labs"
+      package_name: "PEGASUS"
+      icon_url: "hhttps://huggingface.co/datasets/MakiAi/IconAssets/resolve/main/PEGASUS.jpeg"
+      title: "PEGASUS"
+      subtitle: "～ Evolutionary Merge Experiment ～"
+      website_url: "https://hamaruki.com/"
+      github_url: "https://github.com/Sunwood-ai-labs"
+      twitter_url: "https://x.com/hAru_mAki_ch"
+      blog_url: "https://hamaruki.com/"
+
+    product:
+      important_message_file: "important_template.md"
+      sections_content_file: "sections_template.md"
+      output_file: "README_template.md"
+      cicd_file_path: "publish-to-pypi.yml"
+      cicd_main_path: "publish-to-pypi.yml"
+      github_cicd_dir: ".github/workflows"
+
+    llm_product:
+      sections_content_file: "sections_template_llm.md"
+
+    development:
+      output_dir: "C:/Prj/PEGASUS/.harmon_ai"
+
+    main:
+      main_dir: "C:/Prj/PEGASUS/"
+      replace_readme: true
+
+    instructions_prompt: .aira/instructions.md
diff --git a/.gitignore b/.gitignore
@@ -164,4 +164,7 @@ tmp.md
 tmp2.md
 .Prothiel.md
 .Gaiah.md
-.SourceSageAssets
+.SourceSageAssets
+.aira/aira.Gaiah.md
+.harmon_ai/README_template.md
+output
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 <p align="center">
 <img src="https://huggingface.co/datasets/MakiAi/IconAssets/resolve/main/PEGASUS.jpeg" width="100%">
 <br>
-<h1 align="center">PEGASUS</h1>
+<h1 align="center">P.E.G.A.S.U.S</h1>
 <h2 align="center">
   ～ Parsing Extracting Generating Automated Scraping Utility System ～
 <br>
@@ -34,84 +34,79 @@
 >[!IMPORTANT]
 >このリポジトリのリリースノートやREADME、コミットメッセージの9割近くは[claude.ai](https://claude.ai/)や[ChatGPT4](https://chatgpt.com/)を活用した[AIRA](https://github.com/Sunwood-ai-labs/AIRA), [SourceSage](https://github.com/Sunwood-ai-labs/SourceSage), [Gaiah](https://github.com/Sunwood-ai-labs/Gaiah), [HarmonAI_II](https://github.com/Sunwood-ai-labs/HarmonAI_II)で生成しています。
 
-## 🌟 イントロダクション
 
-**Pegasus** は、ウェブサイトを再帰的にクロールし、そのコンテンツを美しくフォーマットされた Markdown ドキュメントに変換する、パワフルで柔軟な Python パッケージです。指定された URL から始まり、リンクをたどって関連するページを探索し、HTML コンテンツを構造化された Markdown ファイルに変換します。コマンドラインインターフェイス（CLI）から実行することも、Python スクリプトから直接使用することもできます。
+pegasus は、ウェブサイトを再帰的にクロールし、そのコンテンツを Markdown 形式に変換するパワフルで柔軟な Python パッケージです。指定した URL から始まり、リンクをたどって関連するページを探索し、HTML コンテンツを美しい Markdown ドキュメントに変換します。コマンドラインインターフェイス (CLI) から実行することも、Python スクリプトから直接使用することもできます。
 
-## 🎥 デモ
+## インストール
 
-*デモ動画は現在準備中です。*
+pip を使用して pegasus をインストールします。
 
-## 🚀 はじめに
-
-このリポジトリには、Pegasus を Docker Compose で簡単に実行するための設定ファイルが含まれています。
-
-### 前提条件
-
-* Docker
-* Docker Compose
-
-### 実行方法
-
-1. リポジトリをクローンします。
-
-```bash
-git clone https://github.com/[あなたのユーザー名]/pegasus-docker-compose.git
+```shell
+pip install pegasus
 ```
 
-2. ディレクトリに移動します。
+## 使い方
 
-```bash
-cd pegasus-docker-compose
-```
-
-3. `.env` ファイルを編集し、`TARGET_URL` をクロールしたいウェブサイトの URL に設定します。
+### コマンドラインから
 
-4. Docker Compose を起動します。
+pegasus をコマンドラインから使用するには、以下のようなコマンドを実行します。
 
-```bash
-docker-compose up -d
+```shell
+pegasus https://example.com/start-page output_directory --exclude-selectors header footer nav --include-domain example.com --exclude-keywords login --output-extension txt
+pegasus  https://docs.eraser.io/docs/what-is-eraser output/eraser_docs --exclude-selectors header footer nav aside .sidebar .header .footer .navigation .breadcrumbs --include-domain docs.eraser.io --exclude-keywords login --output-extension .txt
 ```
 
-5. プロセスが完了すると、Markdown ファイルが `output` ディレクトリに出力されます。
-
-### オプション
+- `https://example.com/start-page`: クロールを開始するベース URL を指定します。
+- `output_directory`: Markdown ファイルを保存するディレクトリを指定します。
+- `--exclude-selectors`: 除外する CSS セレクターをスペース区切りで指定します（オプション）。
+- `--include-domain`: クロールを特定のドメインに限定します（オプション）。
+- `--exclude-keywords`: URL に含まれる場合にページを除外するキーワードをスペース区切りで指定します（オプション）。
 
-`.env` ファイルで以下の環境変数を設定することで、Pegasus の動作をカスタマイズできます。
+### Python スクリプトから
 
-* `TARGET_URL`: クロールするウェブサイトの URL (必須)
-* `OUTPUT_DIRECTORY`: Markdown ファイルを出力するディレクトリ (デフォルト: `./output`)
-* `DEPTH`: クロールする深さ (デフォルト: `-1` (無制限))
-* `LOG_LEVEL`: ログレベル (デフォルト: `INFO`)
+pegasus を Python スクリプトから使用するには、以下のようなコードを書きます。
 
-### 例
+```python
+from pegasus import pegasus
 
-`https://www.example.com` をクロールし、Markdown ファイルを `./my-output` ディレクトリに出力する例:
-
-```
-TARGET_URL=https://www.example.com
-OUTPUT_DIRECTORY=./my-output
+pegasus = pegasus(
+    base_url="https://example.com/start-page",
+    output_dir="output_directory",
+    exclude_selectors=['header', 'footer', 'nav'],
+    include_domain="example.com",
+    exclude_keywords=["login"]
+)
+pegasus.run()
 ```
 
-### 注意
+- `base_url`: クロールを開始するベース URL を指定します。
+- `output_dir`: Markdown ファイルを保存するディレクトリを指定します。
+- `exclude_selectors`: 除外する CSS セレクターのリストを指定します（オプション）。
+- `include_domain`: クロールを特定のドメインに限定します（オプション）。
+- `exclude_keywords`: URL に含まれる場合にページを除外するキーワードのリストを指定します（オプション）。
 
-* Pegasus は、ウェブサイトの構造やコンテンツによっては、期待通りの結果を得られない場合があります。
-* 大規模なウェブサイトをクロールする場合は、時間とリソースの使用量に注意してください。
-* クロールする前に、ウェブサイトの利用規約を確認してください。
+## 特長
 
-## 📝 更新情報
+- 指定した URL から始まり、リンクを再帰的にたどってウェブサイトを探索します。
+- HTML コンテンツを美しくフォーマットされた Markdown に変換します。
+- 柔軟な設定オプションにより、クロールと変換のプロセスをカスタマイズできます。
+- ヘッダー、フッター、ナビゲーションなどの不要な要素を除外できます。
+- 特定のドメインのみをクロールするように制限できます。
+- 特定のキーワードを含む URL を除外できます。
 
-*最新情報については、CHANGELOG.md ファイルを参照してください。*
+## 注意事項
 
-## 🤝 コントリビューション
+- pegasus は、適切な使用方法とウェブサイトの利用規約に従ってご利用ください。
+- 過度なリクエストを送信しないよう、適切な遅延を設けてください。
 
-*コントリビューションは大歓迎です！*
+## ライセンス
 
-## 📄 ライセンス
+このプロジェクトは MIT ライセンスの下で公開されています。詳細については、[LICENSE](LICENSE) ファイルを参照してください。
 
-*このプロジェクトは、[ライセンス名] ライセンスの下でライセンスされています。*
+## 貢献
 
-## 🙏 謝辞
+プルリクエストや改善案は大歓迎です。バグ報告や機能リクエストがある場合は、issue を作成してください。
 
-*Pegasus の開発に貢献してくれたすべての人に感謝します。*
+---
 
+pegasus を使用すれば、ウェブサイトを再帰的に探索し、コンテンツを美しい Markdown ドキュメントに変換できます。ドキュメンテーションの自動化、コンテンツの管理、データ分析などにぜひお役立てください！
diff --git a/demo.py b/demo.py
@@ -0,0 +1,10 @@
+from pegasus.pegasus import Pegasus
+
+pegasus = Pegasus(
+    base_url="https://docs.eraser.io/docs/what-is-eraser",
+    output_dir="eraser_docs",
+    exclude_selectors=['header', 'footer', 'nav', 'aside', '.sidebar', '.header', '.footer', '.navigation', '.breadcrumbs'],
+    include_domain="docs.eraser.io",
+    exclude_keywords=["login"]
+)
+pegasus.run()
diff --git a/example/example01.py b/example/example01.py
@@ -0,0 +1,28 @@
+import requests
+import html2text
+
+def download_and_convert(url, output_file):
+    try:
+        # URLからWebページをダウンロード
+        response = requests.get(url)
+        response.raise_for_status()
+
+        # HTMLをマークダウンに変換
+        h = html2text.HTML2Text()
+        h.ignore_links = True
+        markdown_content = h.handle(response.text)
+
+        # マークダウンをファイルに保存
+        with open(output_file, 'w', encoding='utf-8') as file:
+            file.write(markdown_content)
+
+        print(f"Successfully converted {url} to {output_file}")
+    except requests.exceptions.RequestException as e:
+        print(f"Error downloading {url}: {e}")
+    except IOError as e:
+        print(f"Error writing to {output_file}: {e}")
+
+# 使用例
+url = "https://docs.eraser.io/docs/what-is-eraser"
+output_file = "example.md"
+download_and_convert(url, output_file)
diff --git a/example/example02.py b/example/example02.py
@@ -0,0 +1,50 @@
+import requests
+import html2text
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin, urlparse
+
+def download_and_convert(url, output_dir, visited_urls):
+    if url in visited_urls:
+        return
+    visited_urls.add(url)
+
+    try:
+        # URLからWebページをダウンロード
+        response = requests.get(url)
+        response.raise_for_status()
+
+        # HTMLをマークダウンに変換
+        h = html2text.HTML2Text()
+        h.ignore_links = True
+        markdown_content = h.handle(response.text)
+
+        # マークダウンをファイルに保存
+        parsed_url = urlparse(url)
+        output_file = f"{output_dir}/{parsed_url.path.replace('/', '_')}.md"
+        with open(output_file, 'w', encoding='utf-8') as file:
+            file.write(markdown_content)
+
+        print(f"Successfully converted {url} to {output_file}")
+
+        # ページ内のリンクを探索
+        soup = BeautifulSoup(response.text, 'html.parser')
+        for link in soup.find_all('a'):
+            href = link.get('href')
+            if href:
+                absolute_url = urljoin(url, href)
+                if "docs.eraser.io" in absolute_url:
+                    # docs.eraser.ioを含むURLのみ探索
+                    # URLのフラグメント部分を除去
+                    absolute_url = absolute_url.split('#')[0]
+                    download_and_convert(absolute_url, output_dir, visited_urls)
+
+    except requests.exceptions.RequestException as e:
+        print(f"Error downloading {url}: {e}")
+    except IOError as e:
+        print(f"Error writing to {output_file}: {e}")
+
+# 使用例
+base_url = "https://docs.eraser.io/docs/what-is-eraser"
+output_dir = "eraser_docs"
+visited_urls = set()
+download_and_convert(base_url, output_dir, visited_urls)