<a href="https://colab.research.google.com/github/LostMa-ERC/JonasScraper/blob/main/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use the Jonas Scraper in a terminal

The Jonas Scraper is designed to be used in the terminal as a "Command-line Interface" or CLI. This is in contrast to a "Graphic User Interface" (GUI), which would probably be more fun to use but prohibitively more complicated to install and share.

Therefore, the Jonas Scraper is presented as a simple CLI tool.

## Step 1. Set up

The very first thing to know about using any Python tool is the concept of an environment.

Say you've installed Python, following a [tutorial](https://realpython.com/installing-python/) or something online, and now, you want to install a Python package 📦. The best practice is to install that package in a "virtual environment," so as not to contaminate your entire installation of Python with something you want to use for a specific project.


There are multiple ways to create and activate a virtual Python environment (see [tutorial](https://realpython.com/python-virtual-environments-a-primer/) here). But the good news right now is that we're already in one!

When working in a [Jupyter Notebook](https://docs.jupyter.org/en/latest/), such as this, you're already working inside what's called a "runtime environment." Simply by opening this notebook, which uses one of Google's machines somewhere, we've already completed the first step of setting up an environment.

> If you're not in a notebook, follow tutorials online to learn how to (a) install Python, (b) check which version you have, and (c) create and activate a virtual environment.

We can double check that we're ready to use Python by asking the computer which Python source code it's using. In the case of a Jupyter notebook on Google Colab, it will tell us we're using this virtual machine's main and only Python installation at `/usr/local/bin/python`.

> Note: In Jupyter notebooks, we can use the terminal by adding a `!` before the command. When in a real terminal, do not start things with an exclamation point.

In [7]:
!which python

/usr/local/bin/python


We can also check which version of Python we're using. For the Jonas Scraper we need version 3.10 or greater.

In [8]:
!python --version

Python 3.11.13


## Step 2. Installation

Now that we know we have the right Python environment set up, it's time to install in that virtual environment the Jonas Scraper package 📦.

We'll use (1) Python's native "pip" package manager and (2) GitHub, where the package is stored.

And that's it!

> Note: In Jupyter notebooks, we can use the terminal by adding a `!` before the command. When in a real terminal, do not start things with an exclamation point.

In [1]:
!pip install git+https://github.com/LostMa-ERC/JonasScraper.git

Collecting git+https://github.com/LostMa-ERC/JonasScraper.git
  Cloning https://github.com/LostMa-ERC/JonasScraper.git to /tmp/pip-req-build-pz_ugcgt
  Running command git clone --filter=blob:none --quiet https://github.com/LostMa-ERC/JonasScraper.git /tmp/pip-req-build-pz_ugcgt
  Resolved https://github.com/LostMa-ERC/JonasScraper.git to commit 3d202884fdc37252fca94b8b5e33ae2d8c70ce6c
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting casanova>=2.0.2 (from jonas-scraper==0.0.1)
  Downloading casanova-2.0.2-py3-none-any.whl.metadata (29 kB)
Collecting duckdb>=1.3.0 (from jonas-scraper==0.0.1)
  Downloading duckdb-1.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.0 kB)
Collecting ural>=1.5.0 (from jonas-scraper==0.0.1)
  Downloading ural-1.5.0-py3-none-any.whl.metadata (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Step 3. Use the scraper

Now we're ready to collect some data! The CLI tool is installed and we're ready to call one of its 2 functions.

### One URL

For demonstration purposes, let's call its `url` function, which allows you to copy-paste / type the URL of a Jonas manuscript or work page after the command `jonas url`.

The tool will first scrape the page you gave it (in this case, a manuscript page) and then, having found a bunch of witnesses on that page, scrape more Jonas pages (in this case, work pages) to collect other relevant information about the discovered witnesses.

All of the results (the given page and witnesses) are printed in the terminal and the witnesses are written to a CSV file.

> Note: In Jupyter notebooks, we can use the terminal by adding a `!` before the command. When in a real terminal, do not start things with an exclamation point.

In [2]:
!jonas url "http://jonas.irht.cnrs.fr/manuscrit/72035"

[2KScraping... [32m⠏[0m
[?25h╭──────────────────────────────────────────────────────────────────────────────╮
│ [1;35mManuscript[0m[1m([0m                                                                  │
│     [33mid[0m=[32m'72035'[0m,                                                              │
│     [33mexemplar[0m=[32m'Paris, Bibliothèque nationale de France, Manuscrits, fr. [0m      │
│ [32m00842'[0m,                                                                      │
│     [33mdate[0m=[3;35mNone[0m,                                                               │
│     [33mlanguage[0m=[3;35mNone[0m                                                            │
│ [1m)[0m                                                                            │
╰──────────────────────────────────────────────────────────────────────────────╯
[2K'http://jonas.irht.cnrs.fr/oeuvre/27919' generated an exception: The read 
operation timed out
[2KFetching URLs... [91m━

#### Witnesses file


Because the witnesses can be numerous and printing many in the terminal can make them hard to read, they're also saved in a CSV file under the name `jonas_witnesses.csv`.

If you click on the folder icon 📁 on the left-hand side of this notebook, you'll see that the file was created.*italicized text*

Below, in Python, we open the file and print out the first row.

In [4]:
import csv
from pprint import pprint

with open("jonas_witnesses.csv") as f:
  reader = csv.DictReader(f)
  for row in reader:
    pprint(row)
    break

{'witness_date': '',
 'witness_doc_id': '72035',
 'witness_foliation': 'Folio 114r - 114r',
 'witness_id': 'temoin77336',
 'witness_siglum': '',
 'witness_status': '',
 'witness_work_id': '26561',
 'work_author': 'Mellin de Saint-Gelais',
 'work_date': 'Avant 1558, période des premières attestations manuscrites',
 'work_form': 'vers',
 'work_id': '26561',
 'work_incipit': "Asseuré suis d'estre prys et lyé",
 'work_keywords': '[]',
 'work_language': 'oil-français',
 'work_links': '[]',
 'work_meter': 'Décasyllabes',
 'work_n_verses': '14',
 'work_rhyme_scheme': 'ABBA ABBA CDC DDC',
 'work_scripta': '',
 'work_title': 'Sonnet'}


### Multiple URLs

It's likely that you have a set of Jonas URLs whose metadata you want to collect. The Jonas Scraper's second command is called `scrape` and it takes a CSV file of URLs.

This command requires 3 pieces of information.

- `-i` / `--infile` : the path to the CSV file
- `-c` / `--column` : the name of the column in the file that has the Jonas URLs
- `-o` / `--outdir` : the path to a folder in which the Jonas Scraper will write a witnesses file and save a little database file.

The `scrape` command creates a database file so that, if you need to stop the collection and/or restart it for any reason, and you provide the same `outdir` path as before, it will refrain from revisiting web pages that it already scraped and saved in the database file.

Because we're in a Google Colab notebook, which is a virtual machine somewhere on Google's servers, we'll need to bring our data into this machine's file system.

> Note: When working on a real terminal, providing a path to your CSV file is much easier.

The Jonas Scraper GitHub repository has an example CSV file. We'll download it here and use it for the demonstration.

In [9]:
! curl -o ./example.csv "https://raw.githubusercontent.com/LostMa-ERC/JonasScraper/refs/heads/main/example.csv"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   136  100   136    0     0    651      0 --:--:-- --:--:-- --:--:--   653


We saved an example CSV file to the path `./example.csv` and the URLs are stored in the column named "jonas_url." And with that, we have everything we need to run the command.

In [10]:
!jonas scrape -i ./example.csv -c jonas_url -o output_example/

[2KFetching URLs... [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [33m0:00:02[0m
[2K'http://jonas.irht.cnrs.fr/oeuvre/3669' generated an exception: The read 
operation timed out
[2K'http://jonas.irht.cnrs.fr/oeuvre/4100' generated an exception: The read 
operation timed out
[2KFetching discovered URLs... [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m 9/11[0m [33m0:00:12[0m
[?25hview a list of witnesses in this CSV file: 
[32m'/content/output_example/example_witnesses.csv'[0m


In this example, some of the URLs weren't correctly loaded and accessed from Jonas's server. There are many reasons this might happen, and it's another reason the Jonas Scraper creates a database file (`output_example/jonas.db`), which allows it to retry just the URLs that didn't work.

In [11]:
!jonas scrape -i ./example.csv -c jonas_url -o output_example/

[2KFetching URLs... [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [33m0:00:01[0m
[2KFetching discovered URLs... [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [33m0:00:01[0m
[?25hview a list of witnesses in this CSV file: 
[32m'/content/output_example/example_witnesses.csv'[0m
