# IMPORTANT
As of [July 16, 2024](https://its.umich.edu/computing/ai/release-notes) Maizey now supports mixed data sources, meaning this script has been deprecated. It still may be useful in cases where Maizey's default parser produces inaccurate information, as you can manually modify this parser's outputs; that being said, for the majority of pages this will not be an issue.

# Info
Custom web parser with proposition extraction capability.
This parser is meant for a first pass; for best results, manually audit the results and delete information the parser may have accidentally pulled (i.e. navigation items). Similarly, it may not pull everything, in which case you may have to add content.

The pipeline for parsing websites is as follows:
1. Download websites into ./unparsed
2. Run the parser
3. Run pandoc (optional)
4. Upload to Pascal's data sources

You can download websites using either the scraper or cmd + s (MacOS) or ctrl + s (Windows, Linux) in most web browsers. Make sure to delete non-HTML files from ./unparsed before running.

# Quickstart
To run this parser, you must have [Jupyter](https://jupyter.org/) installed. We recommend installing Jupyter Notebook. Then, create a directory with this parser and two empty folders titled 'parsed' and 'unparsed'. Download any websites you want parsed into 'unparsed'. Then, run the parser via Jupyter Notebook.

# Troubleshooting
If you get an error while running this script, it is likely that there was an unresolved case (for example, the indicator for table rows was not included in the parser). You can either remove the offending website or modify the parser directly to resolve this.

# Using an LLM
This parser allows you to optionally use an LLM to perform proposition extraction.
The idea is that we can pull the most relevant information from a website using an LLM.
This produces superior results when these documents are used by Maizey.
However, it requires a GPT Toolkit API key and is not strictly necessary.
We opted for this technique in the inaugural batch of documents as we already had a token and wanted to use it as a proof-of-concept. It is not recommended for every use case.

## Pros
- Measurably superior retrieval

## Cons
- Requires an API key
- More expensive
- Harder to work with

If you decide you want to use an LLM, this script requires a .env with the following info:

model=**any model**

embedder=**text-embedding-3-small**

azure_endpoint=https://api.umgpt.umich.edu/azure-openai-api

OPENAI_API_KEY=[api key]

OPENAI_organization=**shortcode**

API_VERSION=2023-05-15

Replace the bolded text with the corresponding information. Then, switch use_llm to **True** and modify the prompt as you see fit.

# Convert to .docx

The parser returns .md files by default. While Maizey is able to use .md files, you cannot edit them in Google Docs and they are not formatted in Google Drive's viewer. We therefore recommend using a tool like [pandoc](https://pandoc.org/) to convert these .md files into .docx, a format which Google Docs supports. You can refer to [pandoc's official documentation](https://pandoc.org/installing.html) for information about how to install pandoc. We've also provided a script to run it on all .md files in your current directory below.

`for f in *.md; do pandoc "$f" -s -o "${f%.md}.docx"; done`

In [None]:
from openai import AzureOpenAI
import os
from dotenv import load_dotenv
from bs4 import BeautifulSoup
import codecs

### SETTINGS ###
use_llm = True

prompt = """
You are a helpful assistant. When given a document, extract every proposition into a list and summarize the document afterwards. Keep your answers as short as possible without losing any meaning.
If you are given links, make sure to include them in your list, but not the summary.
Preface your response with "Propositions:" or "Summary:".
Do not include any other information in your response.
Do not comment on the quality of the document.
Do not include any information that is not a proposition or a summary.
Put \\n\\n between each proposition.
"""

if use_llm:
    notebook_dir = os.path.dirname(os.path.abspath('parse.ipynb'))
    os.chdir(notebook_dir)
    
    try:
        if load_dotenv('.env') is False:
            raise TypeError
    except TypeError:
        print('Unable to load .env file.')
        quit()

    client = AzureOpenAI(
        api_key=os.environ['OPENAI_API_KEY'],  
        api_version=os.environ['API_VERSION'],
        azure_endpoint = os.environ['azure_endpoint'],
        organization = os.environ['OPENAI_organization']
    )

    def query(sysprompt = '', doc = ''):
        response = client.chat.completions.create(
            model=os.environ['model'],
            messages=[
                {"role": "system", "content": sysprompt},
                {"role": "user", "content": doc}
            ],
            temperature=0,
            stop=None)
        return response.choices[0].message.content
    
    def extractPropositions(doc):
        return query(prompt, ''.join(doc[1:]))

In [None]:
directory = './unparsed'

def convert_to_markdown(element):
    links = ""
    if element.name == 'a':
        url = element.get('href')
        text = element.text
        if url:
            links += f"{text}({url})"
    else:
        for child in element.descendants:
            if child.name == 'a':
                url = child.get('href')
                text = child.text
                if url:
                    links += f"({url})"
    return element.text + links

for filename in os.listdir(directory):
    if filename.endswith('.html'):
        filepath = os.path.join(directory, filename)

        with codecs.open(filepath, 'r', 'utf-8') as file:
            content = file.read()
            soup = BeautifulSoup(content, 'lxml')

        print(soup.title.string if soup.title else 'No title found')

        markdown_lines = []
        canonical_link = soup.find("link", {"rel": "canonical"})
        if canonical_link:
            link_text = f"Link to original: [{canonical_link['href']}]({canonical_link['href']})\n\n"
            markdown_lines.append(link_text)
        else:
            og_link = soup.find("meta", {"property": "og:url"})
            if og_link:
                link_text = f"Link to original: [{og_link['content']}]({og_link['content']})\n\n"
                markdown_lines.append(link_text)

        body = soup.find("body")

        for nav in body.find_all('nav'):
            nav.decompose()

        elements = body.find_all(['table', 'p', 'h1', 'li'])
        for element in elements:
            if element.name == 'table':
                rows = element.find('tbody').findAll('tr', recursive=False)
                for row in rows:
                    cells = row.findAll('td')
                    for cell in cells:
                        nl = '\n'
                        try:
                            cell_text = f'{cell["data-th"]}: {"; ".join(cell.text.strip().split(nl))}\n\n'
                        except:
                            cell_text = f'{"; ".join(cell.text.strip().split(nl))}\n\n'
                        markdown_lines.append(cell_text)
            elif element.name in ['p', 'li']:
                if element.text.strip() == "":
                    continue
                parentnames = [parent.name for parent in element.parents]
                if 'li' not in parentnames:
                    formatted_line = convert_to_markdown(element).strip()
                    markdown_lines.append("- " + formatted_line + "\n\n")
            elif element.name == 'h1':
                header_text = f"# {element.text}\n\n"
                markdown_lines.append(header_text)
        
        if use_llm:
            propositions = extractPropositions(markdown_lines)
            markdown_lines.append('\n\n')
            markdown_lines.append(propositions)

        output_path = os.path.join('./parsed/', filename[:-5] + '.md')
        with open(output_path, 'w', encoding='utf-8') as md_file:
            md_file.write(''.join(markdown_lines))