# Data scrapping (w/o docker)

The goal of this notebook is to present the very basic steps to scrap an Opam Package using the rocq-ml-toolbox.

## Installation

First, let's start by installing all the requirements.
In this notebook, we will **NOT** use docker.

## Starting servers

Now, we need to start both redis server and the rocq-ml-server.

The redis server:

In [None]:
!docker run -d -p 6379:6379 redis:latest

The rocq-ml-server.

In [None]:
!rocq-ml-server -d --num-pet-server 8 --workers 17

## First example: scrapping a file from stdlib

Let's create and connect a client to the rocq-ml-server.

In [None]:
import time
from rocq_ml_toolbox.inference.client import PetClient

client = PetClient('127.0.0.1', 5000)

for _ in range(10):
    try:
        client.connect()
        print("Connection OK.")
        break
    except:
        print("Wait for server")
        time.sleep(1)

Now let's create a rocq_parser based on this client.

In [None]:
from rocq_ml_toolbox.parser.rocq_parser import RocqParser, Source, Theorem

parser = RocqParser(client)

Let's load a target source file and scrap its TOC:

In [None]:
# Adapt filepath to your opam directory, opam env etc.
filepath = "/home/theo/.opam/mc_dev/lib/coq/user-contrib/Stdlib/Structures/OrdersFacts.v"
source = Source.from_local_path(filepath)

for entry in parser.extract_toc(source):
    kind = entry.kind
    print(f"KIND: {kind}\n{entry.data['content']}")

In [None]:
# Adapt filepath to your opam directory, opam env etc.
filepath = "/home/theo/.opam/mc_dev/lib/coq/user-contrib/Stdlib/Structures/OrdersFacts.v"
source = Source.from_local_path(filepath)

for entry in parser.extract_proofs(source):
    print(entry.element.name)
    print("PROOF")
    for step in entry.steps:
        print(f"tactic: {step.step}")
        dependencies = [f"{dep.name} in {dep.data['fqn']}" for dep in step.dependencies]
        if dependencies:
            print(f"dependencies:\n- {"\n- ".join(dependencies)}")
    print()

## Second example: let's scrap the stdlib entirely

Instead of doing one file at a time, let's try to do it in parallel for each files in the stdlib (beware: may cause high memory usage).

In [None]:
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager
from queue import Empty
import os
from typing import List, Tuple

from tqdm.notebook import tqdm
from rocq_ml_toolbox.inference.client import PetClient
from rocq_ml_toolbox.parser.rocq_parser import RocqParser, Source, Theorem


def extract_source(filepath: str, host: str, port:int, progress_queue):
    client = PetClient(host, port)
    client.connect()
    parser = RocqParser(client)
    source = Source.from_local_path(filepath)

    extracted: List[Theorem] = []
    for theorem in parser.extract_proofs(source, timeout=120):
        extracted.append(theorem)
        progress_queue.put(1)
    return extracted


def run_parallel(jobs, host, port, max_workers=8):
    with Manager() as manager:
        q = manager.Queue()

        with ProcessPoolExecutor(max_workers=max_workers) as executor:
            futures = [executor.submit(extract_source, fp, host, port, q) for fp in jobs]

            # done_files = 0
            elem_pbar = tqdm(desc="Elements extracted")
            while True:
                try:
                    msg = q.get(timeout=0.1)
                except Empty:
                    continue
                if msg:
                    elem_pbar.update(int(msg))

        return [f.result() for f in futures]

# extract_source(filepath, url)
jobs: List[Tuple[str, str]] = []
stdlib_dir = "/home/theo/.opam/mc_dev/lib/coq/user-contrib/Stdlib/"
host = "127.0.0.1"
port = 5000
q = Manager().Queue()
for root, _, filenames in os.walk(stdlib_dir):
    for name in filenames:
        if name.endswith('.v'):
            filepath = os.path.join(root, name)
            jobs.append(filepath)
run_parallel(jobs, host, port)
