# PhytoChat: A Multi-Turn RL-Based LLM for Diagnosis and Treatment of Plant Diseases

### AI 322 Mini Project

Ma. Madecheen S. Pangaliman \
Jessan Rendell G. Belenzo

- - -

### Preliminaries

Clone the GitHub repository at https://github.com/SuperMadee/PhytoChat, and place this Jupyter Notebook in the `PhytoChat` directory. Model checkpoints may be downloaded from https://drive.google.com/drive/folders/1mcexnpnd-XcokrALc6BY68191b-Tri4y?usp=drive_link.

### Install Dependencies

In [None]:
!pip install -r requirements.txt

In [1]:
# Set CUDA Device(s)
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '2'
os.environ['HF_TOKEN'] = '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'

### Crawling Raw Data

Crawl data from the internet.

In [1]:
import trafilatura
import json
from tqdm import tqdm


with open('data/crawled/url_list.txt', 'r') as f:
    urls = f.read().split('\n')

data = []
for url in tqdm(urls):
    try:
        downloaded = trafilatura.fetch_url(url)
        text = trafilatura.extract(downloaded)
        data.append({
            'title': url,
            'url': url,
            'html': text
        })
    except:
        print(f'Failed to download {url}')

with open('data/crawled/webpages.json', 'w') as f:
    json.dump(data, f, indent=4)

100%|██████████| 19/19 [00:33<00:00,  1.78s/it]


Read and parse content from the PDF files.

In [11]:
import glob
import pypdfium2 as pdfium
import json
import os

pdfs_path = 'data/pdfs'
paths = glob.glob(f'{pdfs_path}/*.pdf')


for path in paths:
    filename = os.path.basename(path)
    name = filename.replace('.pdf', '')
    json_filename = filename.replace('.pdf', '.json')

    data = []
    with open(f'data/crawled/{json_filename}', 'w') as f:
        pdf = pdfium.PdfDocument(path)
        n_pages = len(pdf)  # get the number of pages in the document
        for i, page in enumerate(pdf):
            # Load a text page helper
            textpage = page.get_textpage()
            # Extract text from the whole page
            text_all = textpage.get_text_range()
            data.append({
                'title': f"{name} - {i:04d}",
                'url': filename,
                'html': text_all
            })

    with open(f'data/crawled/{json_filename}', 'w') as f:
        json.dump(data, f, indent=4)



### Dataset Generation

SFT Stage

In [6]:
!python generate_sft_data.py

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-11 22:07:29 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-11 22:07:30 utils.py:660] Found nccl from library

DPO and ArCHer

In [9]:
!python generate_dpo_data.py
!python generate_conversations.py
!python generate_dpo_data_multi_turn.py
!python combine_split_dpo_data.py

### Training

In [None]:
!python finetune_sft_llama.py
!python finetune_sft_mistral.py

In [None]:
!python finetune_dpo_llama.py
!python finetune_dpo_mistral.py

In [None]:
!cd ArCHer/archer
!python scripts/run.py --config-name archer_phytochat

### Making Predictions

In [None]:
!python generate_predictions_sft.py
!python generate_predictions_dpo.py
!python generate_predictions_archer_sft.py
!python generate_predictions_archer_dpo.py

### Model Evaluation using METEOR and BLEU

In [10]:
!python evaluate_models.py

[nltk_data] Downloading package wordnet to /home/jessan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jessan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jessan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
---
BLEU and METEOR Scores on SFT test data:
data/predictions/vanilla_mistral_predictions_sft.json
BLEU: 0.017343051048786837
METEOR: 0.22207974599469948

data/predictions/sft_mistral_predictions_sft.json
BLEU: 0.05736011425608496
METEOR: 0.231967200266474

data/predictions/dpo_mistral_predictions_sft.json
BLEU: 0.019063562070634525
METEOR: 0.2246878429356538

data/predictions/archer_mistral_predictions_sft.json
BLEU: 0.018706533694269167
METEOR: 0.2217517769103468

data/predictions/vanilla_llama_predictions_sft.json
BLEU: 0.032966435021055805
METEOR: 0.18702997127730894

data/predictions/sft_llama_predictions_sft.json
BLEU: 0.012