Release 🐎PEGASUS v0.2.0 リリースノート · Sunwood-ai-labs/PEGASUS

新機能

URLリストを記載したテキストファイルを指定してスクレイピングできるようになりました。--url-file オプションを使用して、スクレイピングするURLが記載されたテキストファイルを指定できます。
LLMを使ってスクレイピングしたサイトを分類できるようになりました。--system-message オプションと --classification-prompt オプションを使用して、LLMのシステムメッセージとサイト分類プロンプトを指定できます。

改善

再帰処理の最大深度を指定できるようになりました。--max-depth オプションを使用して、再帰処理の最大深度を指定できます。デフォルトは制限なしです。
出力ファイルの拡張子を指定できるようになりました。--output-extension オプションを使用して、出力ファイルの拡張子を指定できます。デフォルトは .md です。
ダストフォルダに移動するファイルサイズのしきい値を指定できるようになりました。--dust-size オプションを使用して、ダストフォルダに移動するファイルサイズのしきい値をバイト単位で指定できます。デフォルトは 1000 バイトです。

追加されたコマンドラインの引数

--base-url: スクレイピングを開始するベースURLを指定します。
--url-file: スクレイピングするURLが記載されたテキストファイルを指定します。
--output-extension: 出力ファイルの拡張子を指定します。デフォルトは .md です。
--dust-size: ダストフォルダに移動するファイルサイズのしきい値をバイト単位で指定します。デフォルトは 1000 バイトです。
--max-depth: 再帰処理の最大深度を指定します。デフォルトは制限なしです。
--system-message: LLMのシステムメッセージを指定します。サイトの分類に使用されます。
--classification-prompt: LLMのサイト分類プロンプトを指定します。True または False を返すようにしてください。
--max-retries: フィルタリングのリトライ回数の上限を指定します。デフォルトは3回です。
--model: LLMのモデル名を指定します。デフォルトは gemini/gemini-1.5-pro-latest です。
--rate-limit-sleep: レート制限エラー時のスリープ時間を秒単位で指定します。デフォルトは60秒です。
--other-error-sleep: その他のエラー時のスリープ時間を秒単位で指定します。デフォルトは10秒です。

バグ修正

サイト分類でリトライ処理を追加し、エラーが発生した場合に再試行するようにしました。

その他の変更

コードの構造を整理し、可読性を向上させました。
ログ出力を改善し、より詳細な情報を表示するようにしました。
READMEを更新し、新機能やオプションについての説明を追加しました。
PyPIのバージョンを0.1.1から0.2.0に更新しました。

インストール

pip を使用して pegasus をインストールします。

pip install pegasus-surf

使い方

コマンドラインから

pegasus をコマンドラインから使用するには、以下のようなコマンドを実行します。

pegasus --base-url https://example.com/start-page output_directory --exclude-selectors header footer nav --include-domain example.com --exclude-keywords login --output-extension txt

pegasus --url-file urls.txt output/roomba --exclude-selectors header footer nav aside .sidebar .header .footer .navigation .breadcrumbs  --exclude-keywords login --output-extension .txt --max-depth 1

pegasus --url-file urls.txt output/roomba2 --exclude-selectors header footer nav aside .sidebar .header .footer .navigation .breadcrumbs  --exclude-keywords login --output-extension .txt --max-depth 1 --system-message "あなたは、与えられたウェブサイトのコンテンツが特定のトピックに関連する有用な情報を含んでいるかどうかを判断するアシスタントです。トピックに関連する有益な情報が含まれている場合は「True」、そうでない場合は「False」と回答してください。" --classification-prompt "次のウェブサイトのコンテンツは、Roomba APIやiRobotに関する有益な情報を提供していますか？ 提供している場合は「True」、そうでない場合は「False」と回答してください。"

Python スクリプトから

pegasus を Python スクリプトから使用するには、以下のようなコードを書きます。

from pegasus import Pegasus

pegasus = Pegasus(
    output_dir="output_directory", 
    exclude_selectors=['header', 'footer', 'nav'],
    include_domain="example.com",
    exclude_keywords=["login"],
    output_extension=".txt",
    dust_size=500,
    max_depth=2,
    system_message="You are an assistant to determine if the content of a given website contains useful information related to a specific topic. If it contains relevant and beneficial information about the topic, answer 'True', otherwise answer 'False'.",
    classification_prompt="Does the content of the following website provide beneficial information about the Roomba API or iRobot? If so, answer 'True', if not, answer 'False'.",
    max_retries=5,
    model="gemini/gemini-1.5-pro-latest",
    rate_limit_sleep=30,
    other_error_sleep=5
)
pegasus.run("https://example.com/start-page")

PEGASUS v0.2.0 をお楽しみください！ご意見やご要望がありましたら、Issue や Pull Request でお知らせください。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐎PEGASUS v0.2.0 リリースノート

新機能

改善

追加されたコマンドラインの引数

バグ修正

その他の変更

インストール

使い方

コマンドラインから

Python スクリプトから