# KIARAPEANA - exploring code generation for KIARA using the EUROPEANA Python-API as a proof-of-concept

Acknowledgments:

*Mariella De Crouy Chanel (General Ideation and Design of prompt-based module builder)*

*Markus Binsteiner (Technical Support)*

Notebook Author: *Cosimo Palma*  
cosimo.palma@phd.unipi.it

This notebook gathers some outputs of a GPT-4o-based [Kiara Module Builder](https://chatgpt.com/g/g-Z2RwpuJbw-kiara-module-builder). The knowledge base has been built upon kiara code and documentation as freely downloadable at https://github.com/DHARPA-Project and https://dharpa.org/kiara.documentation/latest/ .
As a proof-of-concept, the creation of a Kiara plugin integrating the Europeana API has been selected. For the moment, we only consider tasks featuring the application scenario of *Topic Modelling*. Its integration in a plugin is reserved to a second moment of development.

In this example, the [tutorial for DHBenelux 2023](https://github.com/DHARPA-Project/kiara_plugin.dh_tagung_2023/blob/main/docs/notebooks/Hello_kiara.ipynb) is almost entirely mimicked. However, instead of Italian journals, journals harvested through the Europeana API (pyeuropeana) are used as dataset for the proposed workflow.

First of all, let us download all the necessary packages (the latest version of the 10 core plugins).

In [1]:
!pip install kiara kiara-plugin.core-types kiara-plugin.html kiara-plugin.jupyter kiara-plugin.language-processing kiara-plugin.network-analysis kiara-plugin.onboarding kiara-plugin.streamlit kiara-plugin.tabular

Collecting kiara
  Downloading kiara-0.5.12-py3-none-any.whl.metadata (9.6 kB)
Collecting kiara-plugin.core-types
  Downloading kiara_plugin.core_types-0.5.1-py3-none-any.whl.metadata (5.1 kB)
Collecting kiara-plugin.html
  Downloading kiara_plugin.html-0.5.0-py3-none-any.whl.metadata (6.9 kB)
Collecting kiara-plugin.jupyter
  Downloading kiara_plugin.jupyter-0.5.0-py3-none-any.whl.metadata (6.7 kB)
Collecting kiara-plugin.language-processing
  Downloading kiara_plugin.language_processing-0.5.0-py3-none-any.whl.metadata (6.6 kB)
Collecting kiara-plugin.network-analysis
  Downloading kiara_plugin.network_analysis-0.5.1-py3-none-any.whl.metadata (6.5 kB)
Collecting kiara-plugin.onboarding
  Downloading kiara_plugin.onboarding-0.5.1-py3-none-any.whl.metadata (5.2 kB)
Collecting kiara-plugin.streamlit
  Downloading kiara_plugin.streamlit-0.5.1-py3-none-any.whl.metadata (7.1 kB)
Collecting kiara-plugin.tabular
  Downloading kiara_plugin.tabular-0.5.5-py3-none-any.whl.metadata (5.3 kB)
Colle

At this point, we need to upload our journals as data-bundle, an operation that in the original tutorial is achieved as following:

inputs = {
    "url": "https://github.com/DHARPA-Project/kiara.examples/archive/refs/heads/main.zip",
    "sub_path": "kiara.examples-main/examples/workshops/dh_benelux_2023/data"
 }

dl_bundle = kiara.run_job('download.file_bundle', inputs=inputs)
dl_bundle


What we want to achieve is creating a kiara module for downloading journals directly from Europeana, before proceeding with the data tabularization and analysis.

***A kiara module for downloading journals by means of the Europeana API***

First of all let us install the pyeuropeana library.
The related documentation can be found [here](https://rd-europeana-python-api.readthedocs.io/en/stable/index.html).

In [2]:
!pip install pyeuropeana

Collecting pyeuropeana
  Downloading pyeuropeana-0.1.7-py3-none-any.whl.metadata (4.6 kB)
Collecting fire<0.5,>=0.4 (from pyeuropeana)
  Downloading fire-0.4.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.7/87.7 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pandas<2.0,>=1.3 (from pyeuropeana)
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading pyeuropeana-0.1.7-py3-none-any.whl (17 kB)
Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: fire
  Building wheel for fire (setup.py) ... [?25l[?25hdone
  Created wheel for fire: filename=fire-0.4.0-py2.py3-none-any.whl size=115926 sha256=8a5d24d4c4ad290649cebaecccca544df

There are some issues of dependency resolution that have not been shown in the colab. The main one is with pandas. Europeana needs the version 1.5.3, while pyEuropeana the version 2.0.3. This can be manually (and temporarily) solved in Python colab only by uninstalling and re-installing it everytime, in accordance with the used library. Outside colab, one can use mamba's §venv§ ("Virtual Environment") functionality.

This command provides for a quick view of the installed versions of every package.

In [3]:
!pip list --format=freeze

absl-py==1.4.0
accelerate==0.34.2
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
airium==0.2.6
alabaster==0.7.16
albucore==0.0.19
albumentations==1.4.20
altair==4.2.2
annotated-types==0.7.0
anyio==3.7.1
anywidget==0.9.13
appdirs==1.4.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
array_record==0.5.1
arrow==1.3.0
arviz==0.20.0
astor==0.8.1
astropy==6.1.4
astropy-iers-data==0.2024.10.28.0.34.7
astunparse==1.6.3
async-timeout==4.0.3
atpublic==4.1.0
attrs==24.2.0
audioread==3.0.1
autograd==1.7.0
babel==2.16.0
backcall==0.2.0
backoff==2.2.1
bases==0.3.0
beautifulsoup4==4.12.3
bibtexparser==1.4.2
bidict==0.23.1
bigframes==1.25.0
bigquery-magics==0.4.0
black==24.10.0
bleach==6.2.0
blinker==1.4
blis==0.7.11
blosc2==2.0.0
bokeh==3.4.3
boltons==24.1.0
Bottleneck==1.4.2
bqplot==0.12.43
branca==0.8.0
CacheControl==0.14.0
cachetools==5.5.0
catalogue==2.0.10
certifi==2024.8.30
cffi==1.17.1
chardet==5.2.0
charset-normalizer==3.4.0
chex==0.1.87
clarabel==0.9.0
click==8.1.7
click-default

The following code has been mainly generated as a response to the prompt:

"I would like to use pyeuropeana to download journals that can be used in the topic modeling workflow. Can you provide an example for that?"

Nevertheless, the chatbot was not able to capture the changes in the latest kiara version, since they were not clearly reflected in the documentation (this had to be polished manually).

First of all, let's define a model for the configuration data.

*Simple prompt with corrections*

In [4]:
from kiara.models import KiaraModel
from pydantic import Field
from pathlib import Path
from typing import Dict, Any

class DownloadEuropeanaJournalsConfig(KiaraModel):
    api_key: str = Field(description="API key for accessing Europeana.")
    search_query: str = Field(description="Query string to search for journals.")
    download_path: str = Field(description="Directory path where journals will be saved.", default=Path("./journals"))

*After providing a snippet of pyeuropeana documentation and explicitly mentioning to enter in "Module building" modality*

Next, create a module that uses this configuration to perform its task.

**Warning**: *The Europeana API key used here is my personal one. You should register and get your own to use this colab file publicly.*

In [5]:
# Definition of the Europeana API key (this is mine, replace it with yours!): 'armedinguil'

from kiara.modules import KiaraModule
import pyeuropeana as europeana
import pyeuropeana.apis as apis
import pyeuropeana.utils as utils
import os

os.environ['EUROPEANA_API_KEY'] = 'armedinguil'


class DownloadEuropeanaJournals(KiaraModule):
    _module_type_name = "europeana_journals"
    config=DownloadEuropeanaJournalsConfig
    def __init__(self, config):
        #self.config = config
        super().__init__()

    def create_inputs_schema(self):
          return {
              "api_key": {
                  "type": "string",
                  "description": "API key for accessing Europeana."
              },
              "search_query": {
                  "type": "string",
                  "description": "Query string to search for historical data."
              },
              "download_path": {
                  "type": "string",
                  "description": "Directory path where data will be saved."
              }
          }

    def create_outputs_schema(self):
        return {
            "table_output": {
                "type": "table"
            }
        }

    def download_journals(self, api_key, search_query, download_path):
        data = apis.iiif.search(
        query = search_query,
        profile = 'hits'
        )

        if not os.path.exists(download_path):
            os.makedirs(download_path)

        for item in data['items']:
            title = item['title'][0] if 'title' in item else 'unknown'
            data = apis.record(item['id'])

            with open(os.path.join(download_path, f"{title}.txt"), 'w', encoding='utf-8') as file:
                file.write(str(data))

        return data

    def run(self) -> Dict[str, Any]:
        data = self.download_journals(config.api_key, config.search_query, config.download_path)
        return data


After the module creation, we have to register it by creating a kiara instance.

In [6]:
from kiara.api import KiaraAPI

kiara = KiaraAPI.instance()

# Define the configuration

config= DownloadEuropeanaJournalsConfig
config.api_key= "armedinguil",
config.search_query = "immigrants",
config.download_path = "./test_journals"

module = DownloadEuropeanaJournals(config)
result = module.run()

print(result)


{'apikey': 'armedinguil', 'success': True, 'statsDuration': 188, 'requestNumber': 999, 'object': {'about': '/9200359/BibliographicResource_3000115973110', 'aggregations': [{'about': '/aggregation/provider/9200359/BibliographicResource_3000115973110', 'edmDataProvider': {'def': ['http://data.europeana.eu/organization/1482250000001710507']}, 'edmIsShownBy': 'http://imageviewer.kb.nl/ImagingService/imagingService?id=ddd:010265195:mpeg21:p001:image', 'edmIsShownAt': 'http://kranten.delpher.nl/nl/view/index?image=ddd:010265195:mpeg21', 'edmObject': 'http://imageviewer.kb.nl/ImagingService/imagingService?id=ddd:010265195:mpeg21:p001:image&w=400', 'edmProvider': {'def': ['http://data.europeana.eu/organization/1482250000004516062']}, 'edmRights': {'def': ['http://creativecommons.org/publicdomain/mark/1.0/']}, 'edmUgc': 'false', 'hasView': ['http://imageviewer.kb.nl/ImagingService/imagingService?id=ddd:010265195:mpeg21:p002:image', 'http://imageviewer.kb.nl/ImagingService/imagingService?id=ddd:

Next, we define the PreprocessJournals module. This module preprocesses journal data (text) for topic modeling by tokenizing it.

In [7]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import requests
import re
import pyeuropeana.apis as apis

nltk.download('stopwords')

class PreprocessJournalsConfig(KiaraModel):
    journals: Dict[str, str] = Field(description="Dictionary with journal titles as keys and links to the stored journals as values.")

class PreprocessJournals(KiaraModule):
    """
    A Kiara module to preprocess journal annotation data for topic modeling.
    It fetches annotations from Europeana using the provided links.
    """
    _module_type_name = "preprocess_journals"
    config : PreprocessJournalsConfig
    def __init__(self, config):
        # Call the KiaraModule's initialization
        super().__init__()

    def clean_text(self, text: str) -> str:
        """
        Cleans the text by removing punctuation and non-alphabetic characters.
        """
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        return text.lower()

    def fetch_full_text(self, record_id: str) -> str:
        """
        Fetches the full text of the journal using its metadata (manifest and fulltext).
        """
        try:
            # Fetch the manifest for the journal using the correct record ID format
            manifest_data = apis.iiif.manifest(record_id)
            full_text = ""

            # Iterate over the pages to fetch their full text
            for page in manifest_data.get('sequences', [{}])[0].get('canvases', []):
                # Get page ID and fulltext ID from the metadata
                page_id = page.get('label', 1)  # Default to page 1 if not found
                fulltext_id = page.get('otherContent', [{}])[0].get('@id', '')

                if page_id and fulltext_id:
                    # Fetch the full text for this page (response is plain text, not JSON)
                    page_text = apis.iiif.fulltext(RECORD_ID=record_id, FULLTEXT_ID=fulltext_id)

                    # Append the plain text of the page to the full text
                    full_text += page_text + " "

            return full_text.strip()
        except Exception as e:
            print(f"Error fetching full text for {record_id}: {e}")
            return ""


    def create_inputs_schema(self) -> Dict[str, Any]:
        """
        Define the schema for the module inputs.
        """
        return {
            "journals": {
                "type": "dict",
                "description": "A dictionary with journal titles as keys and their Europeana record IDs (relative path) as values.",
                "required": True
            }
        }

    def create_outputs_schema(self) -> Dict[str, Any]:
        """
        Define the schema for the module outputs.
        """
        return {
            "tokenized_full_texts": {
                "type": "dict",
                "description": "A dictionary of journal titles with their tokenized full texts.",
                "required": True
            }
        }

    def run(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        """
        Main method that runs the preprocessing on the journal full text.
        Fetches the full text and preprocesses it into tokenized text.
        """
        stop_words = set(stopwords.words('english'))
        tokenized_full_texts = {}

        # Retrieve the journals dictionary from the inputs
        journals = inputs.get('journals')

        # Iterate through each journal and process its full text
        for title, record_id in journals.items():
            # Ensure the RECORD_ID is in the proper format without "https://www.europeana.eu/item"
            if record_id.startswith("https://www.europeana.eu/item"):
                record_id = record_id.replace("https://www.europeana.eu/item", "")

            full_text = self.fetch_full_text(record_id)
            if full_text:
                cleaned_text = self.clean_text(full_text)
                tokens = word_tokenize(cleaned_text)
                filtered_tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
                tokenized_full_texts[title] = filtered_tokens
            else:
                print(f"No full text found for {title}")

        # Return the tokenized full texts
        return {"tokenized_full_texts": tokenized_full_texts}



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


The API for harvesting journals do not seem to work properly. As a workaround, a web scraper has been built which fetches the text automatically from the website. This is a cumbersome solution, that shall be replaced by the previous one as soon as it is amended.

In [8]:

import re
import requests
from bs4 import BeautifulSoup
from typing import Dict, Any
#from kiara import KiaraModule
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

# Download stopwords if not already downloaded
nltk.download('stopwords')

class PreprocessJournalsScraper(KiaraModule):
    """
    A Kiara module that scrapes the text of journals from Europeana webpages.
    It specifically looks for elements with the class "mirador92 mirador-third-party-html".
    """
    _module_type_name = "preprocess_journals_scraper"

    config : PreprocessJournalsConfig
    def __init__(self, config):
        # Call the KiaraModule's initialization
        super().__init__()

    def clean_text(self, text: str) -> str:
        """
        Cleans the text by removing punctuation and non-alphabetic characters.
        """
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        return text.lower()

    def scrape_journal_text(self, journal_url: str) -> str:
        """
        Scrapes the full text from elements with class 'mirador92 mirador-third-party-html'
        in the Europeana journal webpage.
        """
        try:
            # Fetch the webpage content
            response = requests.get(journal_url)
            print(str(response))
            if response.status_code == 200:
                # Parse the page content using BeautifulSoup
                soup = BeautifulSoup(response.content, 'html.parser')

                # Find elements with the class 'mirador92 mirador-third-party-html'
                text = ""
                elements = soup.find_all('span', class_="mirador92 mirador-third-party-html").count()
                print(str(elements))
                # Extract text from these elements
                for element in elements:
                    text += element.get_text(separator=" ") + " "
                print(text)
                return text.strip()
            else:
                print(f"Failed to fetch {journal_url}. Status code: {response.status_code}")
                return ""
        except Exception as e:
            print(f"Error scraping text from {journal_url}: {e}")
            return ""

    def create_inputs_schema(self) -> Dict[str, Any]:
        """
        Define the schema for the module inputs.
        """
        return {
            "journals": {
                "type": "dict",
                "description": "A dictionary with journal titles as keys and URLs to the Europeana journal pages as values.",
                "required": True
            }
        }

    def create_outputs_schema(self) -> Dict[str, Any]:
        """
        Define the schema for the module outputs.
        """
        return {
            "tokenized_full_texts": {
                "type": "dict",
                "description": "A dictionary of journal titles with their tokenized full texts after scraping.",
                "required": True
            }
        }

    def run(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        """
        Main method that runs the preprocessing on the journal webpages.
        Scrapes the full text and preprocesses it into tokenized text.
        """
        stop_words = set(stopwords.words('english'))
        tokenized_full_texts = {}

        # Retrieve the journals dictionary from the inputs
        journals = inputs.get('journals')

        # Iterate through each journal and scrape its full text
        for title, journal_url in journals.items():
            full_text = self.scrape_journal_text(journal_url)
            if full_text:
                cleaned_text = self.clean_text(full_text)
                tokens = word_tokenize(cleaned_text)
                filtered_tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
                tokenized_full_texts[title] = filtered_tokens
            else:
                print(f"No full text found for {title}")

        # Return the tokenized full texts
        return {"tokenized_full_texts": tokenized_full_texts}


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:

# Example input: A dictionary of journal titles and their Europeana URLs
journals_input = {
    "De Tĳd : godsdienstig-staatkundig dagblad - 1877-12-18": "https://www.europeana.eu/en/item/9200359/BibliographicResource_3000115973110"}

config = PreprocessJournalsConfig
config.journals = journals_input
# Create an instance of the PreprocessJournals module
preprocess_module = PreprocessJournalsScraper(config)
inputs = {
    "journals": journals_input
}
# Run the module with the input
result = preprocess_module.run(inputs)

# Output the tokenized annotations
tokenized_annotations = result.get("tokenized_annotations")
print(tokenized_annotations)


<Response [200]>
Error scraping text from https://www.europeana.eu/en/item/9200359/BibliographicResource_3000115973110: ResultSet.count() takes exactly one argument (0 given)
No full text found for De Tĳd : godsdienstig-staatkundig dagblad - 1877-12-18
None


Also this alternative seems to fail, as the annotation is written into a JS event. Webpages can be scraped dynamically only by means of SELENIUM. In the following step, we should define the pipeline that connects the two modules: one for downloading the journals and one for preprocessing the journal text (workflow method is deprecated, the pipeline shall be used instead).

In [10]:
!pip install selenium

Collecting selenium
  Downloading selenium-4.26.1-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.27.0-py3-none-any.whl.metadata (8.6 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.26.1-py3-none-any.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio-0.27.0-py3-none-any.whl (481 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.7/481.7 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Downloading wsproto-1.2.0-py3-none-any.w

In [11]:
!apt-get update
!apt-get install -y chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin


0% [Working]            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
0% [Connecting to archive.ubuntu.com (185.125.190.81)] [1 InRelease 14.2 kB/129 kB 11%] [Waiting for                                                                                                    Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
0% [Connecting to archive.ubuntu.com (185.125.190.81)] [1 InRelease 14.2 kB/129 kB 11%] [2 InRelease0% [Connecting to archive.ubuntu.com (185.125.190.81)] [1 InRelease 35.9 kB/129 kB 28%] [Waiting for                                                                                                    Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:7 https://ppa.l

In [29]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

def scrape_europeana_text(url):
  options = Options()
  options.add_argument('--headless')
  options.add_argument('--no-sandbox')
  options.add_argument('--disable-dev-shm-usage')
  text_content = ""
  driver = webdriver.Chrome(options=options)
  # Initialize the driver (make sure you have the appropriate driver installed)
  try:
    # Navigate to page
    print("Navigating to URL...")
    driver.get(url)
    time.sleep(5)
    print("Page loaded. Title:", driver.title)

    # Try different XPath patterns
    xpath_patterns = [
        "//span[contains(@class, 'mirador84')]",
        "//span[contains(@class, 'mirador-third-party-html')]",
        "//span[contains(@class, 'mirador')]",
        "//li//span[contains(@class, 'mirador84')]",
        "//div//span[contains(@class, 'mirador84')]",
        "//span",  # Most general pattern
    ]

    for xpath in xpath_patterns:
        spans = driver.find_elements(By.XPATH, xpath)
        print(f"\nTrying XPath: {xpath}")
        print(f"Found {len(spans)} elements")

        if spans:
            print("Sample texts found:")
            for i, span in enumerate(spans[:3]):  # Show first 3 examples
                print(f"Span {i+1}: {span.text[:100]}")

    # Save page source for inspection
    with open('page_source.html', 'w', encoding='utf-8') as f:
        f.write(driver.page_source)
    print("\nPage source saved to page_source.html")

    # Take screenshot for visual reference
    driver.save_screenshot("page.png")
    print("Screenshot saved as page.png")

    # Print some example HTML structure
    print("\nExample of page structure:")
    print(driver.find_element(By.TAG_NAME, "body").get_attribute('innerHTML')[:500])

    xpath = "//span[contains(@class, 'mirador-third-party-html')]"
    spans = driver.find_elements(By.XPATH, xpath)

    print(f"Found {len(spans)} spans")

    # Extract text from each span
    for i, span in enumerate(spans, 1):
        text = span.text.strip()
        if text:
            print(f"Processing span {i}: {text[:50]}...")  # Show first 50 chars
            text_content += text + "\n"

    # Save the content
    if text_content:
        with open('journal_text.txt', 'w', encoding='utf-8') as f:
            f.write(text_content)
        print("\nText saved to extracted_text.txt")
    else:
        print("No text was extracted")

  except Exception as e:
      print(f"An error occurred: {str(e)}")

  finally:
      driver.quit()
      print("\nBrowser closed")
# URL to scrape
url = "https://www.europeana.eu/en/item/9200359/BibliographicResource_3000115973110"

scrape_europeana_text(url)

Navigating to URL...
Page loaded. Title: De Tĳd : godsdienstig-staatkundig dagblad - 1877-12-18 | Europeana

Trying XPath: //span[contains(@class, 'mirador84')]
Found 1 elements
Sample texts found:
Span 1: 

Trying XPath: //span[contains(@class, 'mirador-third-party-html')]
Found 564 elements
Sample texts found:
Span 1: 
Span 2: 
Span 3: 

Trying XPath: //span[contains(@class, 'mirador')]
Found 575 elements
Sample texts found:
Span 1: 
Span 2: 
Span 3: 

Trying XPath: //li//span[contains(@class, 'mirador84')]
Found 0 elements

Trying XPath: //div//span[contains(@class, 'mirador84')]
Found 1 elements
Sample texts found:
Span 1: 

Trying XPath: //span
Found 1803 elements
Sample texts found:
Span 1: 
Span 2: 
Span 3: 

Page source saved to page_source.html
Screenshot saved as page.png

Example of page structure:

    <div id="__nuxt"><!----><div id="__layout"><div data-v-3ca9fe1f=""><div data-v-332eadba="" data-v-3ca9fe1f="" id="announcer" aria-live="polite" class="announcer" data-qa="vue

This seems as a good moment to stop our attempt at modules creation for Journals Harvesting using an AI assistant (in this case, the Kiara Module Builder created with Chat-GPT-4o).
The journal has been successfully stored in the journal_text.txt file, and now can be further processed using topic modeling or other methods of computational linguistics, depending on the user requirements, as explained in this [tutorial](https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/blob/develop/docs/jupyter/kiara_topic_modelling.ipynb).
Many other journals can be downloaded from Europeana, whose titles, country of origin and further metadata can be stored in a querable .csv file, as described in the tutorial for [DHBenelux 2023](https://github.com/DHARPA-Project/kiara_plugin.dh_tagung_2023/blob/main/docs/notebooks/Hello_kiara.ipynb).
In this way, the presented workflow enhances the one presented therein.

However, the initial focus of this experiment, was exploring the utility of AI-assistants in building new modules through the steps identified by Mariella.

1. Inputs and Outputs identification
2. Inputs and Outputs data types
3. Kiara module creation
4. Kiara module optimization
5. Kiara module test writing


After getting encouraging, yet not properly functioning results, the focus has shifted in polishing out the scripts to produce practically useful code. At that point, all queries to the PLM were not registered. Moreover, they did not require our Module Builder.

This experiment has been important to understand misalignments between the documentation and the codebase.

***Concluding remarks***

* The system can be enhanced by the logs of the user interaction, where he/she
will take care of specifying both successful and unsuccessful code generations, and the specific reasons (as well as the corrections) of the latter ones;
* The question-answering workflow does not accompany the user to the final solution with the same granularity specified in the code's builder initial settings. After the first prompt, the system generated a directly an exhaustive answer, which although organizes very clean and reasonable bullet points;
* Retrieval Augmentation can be fostered by enhancing the knowledge base with comments, structure, and "meta-documentation";
* From the user side, it must be explicitly mentioned that extracting already existing information has to be prioritized. Stating at the very beginning that the "module building" mode needs to be enteredm has also shown significant improvement of the generation's quality, initiating a proactive behavior from the chatbot;
*   Considering the lack of synchronization between the documentation and the codebase, it is thinkable to shape the prompts so that they give priority to the codebase, which is usually more updated.




**Exploration of Fine-tuning approach**

The presented workflows have been useful to assess the utility of LLM-powered chatbots in code-generation. However, it would be desirable that to this first step the integration of the very question-answering task into kiara follow, so that the system can organically work without third-parts tools, such as OpenAI ChatGPT. The LLMs against which the generation will be tested are to be collected in [HuggingFace](https://huggingface.co/models?pipeline_tag=document-question-answering&sort=trending) among the well known Mistral, LLAMA, Alpaca, Bloom and similar. Momentarily this step is left aside in order to correct the knowledge base according to the pitfalls individuated in the previous assessment.

*Installing libraries*

In [None]:
!pip install transformers scikit-learn torch

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

*Preprocessing and Vectorizing the Documentation*

In [None]:
import os
from sklearn.feature_extraction.text import TfidfVectorizer

def preprocess_text(file_path):
    with open(file_path, 'r') as file:
        text = file.read()
    text = text.lower()
    text = ''.join(char for char in text if char.isalnum() or char.isspace())
    return text

def vectorize_text(text):
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform([text])
    feature_names = vectorizer.get_feature_names_out()
    dense = vectors.todense()
    denselist = dense.tolist()
    return denselist, feature_names

# Upload your .txt file to Colab
from google.colab import files
uploaded = files.upload()

# Assuming the uploaded file is named 'document.txt'
file_path = list(uploaded.keys())[0]

preprocessed_text = preprocess_text(file_path)
vectors, feature_names = vectorize_text(preprocessed_text)

print("Feature names:", feature_names[:10])  # Print first 10 feature names for verification
print("Vectors:", vectors[:10])  # Print first 10 vectors for verification


Saving DHARPA_Project_code - documentation.txt to DHARPA_Project_code - documentation.txt
Feature names: ['0000' '0009per' '001' '0010' '0010si' '0011' '0011a' '0011il' '0011per'
 '0011un']
Vectors: [[0.00012652244863409635, 6.326122431704818e-05, 0.0008856571404386743, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 0.00012652244863409635, 6.326122431704818e-05, 6.326122431704818e-05, 0.00012652244863409635, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 0.0001897836729511445, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 0.00012652244863409635, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.326122431704818e-05, 6.32612243170

***Generating Code Using LLAMA, GPT-3.5, Bloom, Mistral, etc.***

#Appendix

Snippets of code to save all kiara codebase into a single txt file:

# Clone the main kiara repository
git clone https://github.com/DHARPA-Project/kiara.git

# Clone the kiara_plugin.network_analysis repository
git clone https://github.com/DHARPA-Project/kiara_plugin.network_analysis.git

# Clone the NetworkAnalysis repository
git clone https://github.com/DHARPA-Project/NetworkAnalysis.git

# Clone the TopicModelling- repository
git clone https://github.com/DHARPA-Project/TopicModelling-.git

# Clone the jupyterlab-extension-example repository
git clone https://github.com/DHARPA-Project/jupyterlab-extension-example.git

# Clone the asciinet repository
git clone https://github.com/DHARPA-Project/asciinet.git


# Navigate to the kiara repository
cd kiara
# List all files
find . > ../kiara_files.txt
# Return to the parent directory
cd ..

# Repeat for each repository
cd kiara_plugin.network_analysis
find . > ../kiara_plugin_network_analysis_files.txt
cd ..

cd NetworkAnalysis
find . > ../NetworkAnalysis_files.txt
cd ..

cd TopicModelling-
find . > ../TopicModelling_files.txt
cd ..

cd jupyterlab-extension-example
find . > ../jupyterlab_extension_example_files.txt
cd ..

cd asciinet
find . > ../asciinet_files.txt
cd ..


# Combine all listings into a single file
cat kiara_files.txt kiara_plugin_network_analysis_files.txt NetworkAnalysis_files.txt TopicModelling_files.txt jupyterlab_extension_example_files.txt asciinet_files.txt > DHARPA_Project_files.txt

then run:





In [None]:
import os

# List of repository directories
repos = [
    "kiara",
    "kiara_plugin.network_analysis",
    "NetworkAnalysis",
    "TopicModelling-",
    "jupyterlab-extension-example",
    "asciinet"
]

# Output file
output_file = "DHARPA_Project_code.txt"

with open(output_file, 'w', encoding='utf-8') as outfile:
    for repo in repos:
        for root, _, files in os.walk(repo):
            for file in files:
                file_path = os.path.join(root, file)
                if file.endswith('.py') or file.endswith('.md') or file.endswith('.txt') or file.endswith('.sh') or file.endswith('.json') or file.endswith('.js') or file.endswith('.yml'):
                    outfile.write(f"\n\n# {file_path}\n")
                    with open(file_path, 'r') as infile:
                        outfile.write(infile.read())