<a href="https://colab.research.google.com/github/JackGraymer/Advanced-GenAI/blob/main/1_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Generative Artificial Intelligence
**Project - Designing a RAG-Based Q&A System for News Retrieval**

**Authors:** Vsevolod Mironov, Pascal Küng, Alvaro Cervan (Group 5)


# Step 1 - Data preparation

**Contribution:** ....

**Goal of this step:** ....

# Loading, Parsing, and Cleaning HTML Files (5 Points)

## 1. Setup of the environment

Below the necessary libraries are installed and loaded into the environment.

In [None]:
!pip install -q beautifulsoup4==4.13.4
!pip install -q docling==2.31.0
from bs4 import BeautifulSoup, Comment
import docling
from docling.document_converter import DocumentConverter, InputFormat, HTMLFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import PipelineOptions

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.1/166.1 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.3/142.3 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.9/80.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.1/15.1 MB[0m [31m127.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m118.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m4.5 MB/s[0m eta 

In [1]:
import os
import re
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
import matplotlib.pyplot as plt
import tempfile

In [2]:
# Set the seed for consistent results
seed_value = 2138247234
random.seed(seed_value)
np.random.seed(seed_value)
os.environ['PYTHONHASHSEED'] = str(seed_value)

Below we mount a shared Google Drive folder as a data storage and define the base path of the folder that will be used in the runtime.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
base_folder = '/content/drive/MyDrive/AdvGenAI'

## 2. Loading the raw data

### Loading

We go through the subdirectories inside the data-folder. Inside those folders the individual html-files will be read and the content will be saved together with the information of the file-name and the path of the file (to store in which subfolder it was located).

In [None]:
# Definition of data folder
data_folder = os.path.join(base_folder, 'data')

In [None]:
# List to hold the dictionaries
data = []

# Walk through all directories and subdirectories
for root, dirs, files in os.walk(data_folder):
    for file in files:
        if file.endswith('.html'):
            file_path = os.path.join(root, file)

            # Read the content of the HTML file
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()

            # Add a dictionary to the list
            data.append({
                'folder_path': root,
                'file_name': file,
                'full_path': file_path,
                'html_content': content
            })

# Convert to DataFrame
df = pd.DataFrame(data)

# Optionally save DataFrame, e.g. to CSV or pickle for later use
# df.to_csv('html_files_content.csv', index=False)
# df.to_pickle('html_files_content.pkl')

# Show first rows to verify
print(df.head())

KeyboardInterrupt: 

In [None]:
pd.set_option('display.max_colwidth', 50)
df.head()

Unnamed: 0,folder_path,file_name,full_path,html_content
0,/content/drive/MyDrive/AdvGenAI/data/de_intern...,der-r-pionier1.html,/content/drive/MyDrive/AdvGenAI/data/de_intern...,"<div class=""text-image cq-dd-image"">\n<p>Währe..."
1,/content/drive/MyDrive/AdvGenAI/data/de_intern...,web-of-science-alles-neu-macht-der--januar-.html,/content/drive/MyDrive/AdvGenAI/data/de_intern...,"<div class=""text-image cq-dd-image"">\n<p><a cl..."
2,/content/drive/MyDrive/AdvGenAI/data/de_intern...,swiss-life-sciences-2014-experten-gesucht.html,/content/drive/MyDrive/AdvGenAI/data/de_intern...,"<div class=""text-image cq-dd-image"">\n<h2>Staf..."
3,/content/drive/MyDrive/AdvGenAI/data/de_intern...,mendeley-literaturverwaltung.html,/content/drive/MyDrive/AdvGenAI/data/de_intern...,"<div class=""text-image cq-dd-image"">\n<p><a cl..."
4,/content/drive/MyDrive/AdvGenAI/data/de_intern...,mehr-orientierung.html,/content/drive/MyDrive/AdvGenAI/data/de_intern...,"<div class=""text-image cq-dd-image"">\n<p>Das S..."


### Checking completeness of loading

**Number of files**

Below we compare the number of documents collected by the function into the Dataframe with a selection of all files in the data folder.

In the check 3 files were discovered that were not part of the dataframe. After inspection it was discovered that those are `.DS_Store`file, for which it makes sense that they were not included.

In [None]:
# Dataframe
print(f"Number of files in the DataFrame: {len(df)}")

Number of files in the DataFrame: 4390


In [None]:
# Files in Data folder
print(f"Number of files in the data folder:")
!find "$data_folder" -type f | wc -l

Number of files in the data folder:
4393


In [None]:
!find "$data_folder" -type f | sort > folder_files.txt
df['full_path'].sort_values().to_csv('df_files.txt', index=False, header=False)
!sort folder_files.txt -o folder_files.txt
!sort df_files.txt -o df_files.txt
!comm -23 folder_files.txt df_files.txt

/content/drive/MyDrive/AdvGenAI/data/de_internal/2013/.DS_Store
/content/drive/MyDrive/AdvGenAI/data/de_internal/2015/.DS_Store
/content/drive/MyDrive/AdvGenAI/data/de_internal/2024/.DS_Store


**Checking for empty files**

Below we print out the rows of the dataframe with empty contents.

In [None]:
df[df['html_content'].isna()].head(10)

Unnamed: 0,folder_path,file_name,full_path,html_content
872,/content/drive/MyDrive/AdvGenAI/data/de_internal/2023/06,neue-indesign-vorlagen-fuer-mehr-barrierefreiheit.html,/content/drive/MyDrive/AdvGenAI/data/de_internal/2023/06/neue-indesign-vorlagen-fuer-mehr-barrierefreiheit.html,
893,/content/drive/MyDrive/AdvGenAI/data/de_internal/2024/02,call-2024-foerderprogramm-internationale-kooperation-in-der-lehre.html,/content/drive/MyDrive/AdvGenAI/data/de_internal/2024/02/call-2024-foerderprogramm-internationale-kooperation-in-der-lehre.html,
1212,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2021/08,einstein-quiz.html,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2021/08/einstein-quiz.html,
1847,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2016/08,pp_pitch_elgar_fleisch.html,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2016/08/pp_pitch_elgar_fleisch.html,
2045,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/10,annette-oxenius-erhaelt-cloetta-preis.html,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/10/annette-oxenius-erhaelt-cloetta-preis.html,
2049,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/10,kleider-virtuell-anprobieren.html,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/10/kleider-virtuell-anprobieren.html,
2050,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/10,neue-daten-sprechen-fuer-magma-auf-dem-mars.html,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/10/neue-daten-sprechen-fuer-magma-auf-dem-mars.html,
2080,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/11,proteinformen-zeigen-parkinson-krankheit-an.html,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/11/proteinformen-zeigen-parkinson-krankheit-an.html,
2083,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/11,kunst-aus-dem-computer.html,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/11/kunst-aus-dem-computer.html,
2090,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/09,elektroflugzeug-e-sling-hebt-ab.html,/content/drive/MyDrive/AdvGenAI/data/de_news_events/2022/09/elektroflugzeug-e-sling-hebt-ab.html,


After checking the files in the original data source we concluded that those files were empty files and therefore it was not a problem in the process of the data loading. We therefore exclude those rows from the dataframe.

In [None]:
df = df[~df['html_content'].isna()].copy()

### Saving the data to storage

In [None]:
# Saving the Dataframe to the Google Drive storage
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-01-raw-data.csv'), index=False)

## 3. Parsing and cleaning the HTML files

### Loading the data from storage

In [None]:
# Load csv from Google Drive Storage to Dataframe
df = pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-01-raw-data.csv'))

In [None]:
pd.set_option('display.max_colwidth', 300)
df[["html_content"]].head(5)

Unnamed: 0,html_content
0,"<div class=""text-image cq-dd-image"">\n<p>Während andere in den Sechziger-Jahren Lokführer werden wollten, war für Martin Mächler schon immer klar: er würde Forscher werden. Das änderte sich auch auf dem Gymnasium nicht. Wenn er sich nicht gerade mit Mathematik beschäftigte, tüftelte er an Comput..."
1,"<div class=""text-image cq-dd-image"">\n<p><a class=""eth-link"" href=""http://www.library.ethz.ch/de/Ueber-uns/Aktuell/Web-of-Science-Alles-neu-macht-der-Januar"">Weitere Informationen</a><br/> </p>\n</div>\n<div class=""text-image cq-dd-image"">\n<h2>Staffnet</h2>\n<p>Das <a class=""eth-link"" href=""htt..."
2,"<div class=""text-image cq-dd-image"">\n<h2>Staffnet</h2>\n<p>Das <a class=""eth-link"" href=""https://ethz.ch/services/de.html"">Info-Portal</a> für Mitarbeitende mit den wichtigsten Informationen rund um das Geschehen an der ETH Zürich.</p>\n</div>\n<div class=""text-image cq-dd-image"">\n<h2>Newslett..."
3,"<div class=""text-image cq-dd-image"">\n<p><a class=""eth-link"" href=""http://www.library.ethz.ch/de/Dienstleistungen/Schulungen-Tutorials-Fuehrungen/Mendeley-Literaturverwaltung"" title=""Weitere Informationen auf der Website der ETH-Bibliothek"">Weitere Informationen</a><br/> </p>\n</div>\n<div class..."
4,"<div class=""text-image cq-dd-image"">\n<p>Das Studierendenleben an der ETH dreht sich momentan nur um eines: Die anstehenden Prüfungen. Während andere sich in den Bergen beim Skifahren vergnügen, sind die Bibliotheken an der ETH bis spät am Abend zum Brechen voll, im Stundentakt begegnet man auf ..."


### BeautifulSoup

#### Definition of cleaning function

Below a function is defined to clean the stored strings of the html-files using `BeautifulSoup`. It extracts the title and main texts of the documents while removing various elements that are not of interest for the further analysis (for example style and navigation elements).

In [None]:
def clean_html(html_content):
    from bs4 import BeautifulSoup, Comment
    import re

    soup = BeautifulSoup(html_content, 'html.parser')

    # Title extraction
    title = soup.title.get_text(strip=True) if soup.title else ''

    # Remove unwanted elements
    for el in soup(['script', 'style', 'header', 'footer', 'nav', 'iframe', 'meta', 'link']):
        el.decompose()
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Replace <br> with newline
    for br in soup.find_all("<br/>"):
        br.replace_with("\n")

    # Get the content from body if exists
    content = soup.body or soup

    # get_text with separator
    clean_text = content.get_text(separator='\n\n').strip()

    # Post-process: Collapse excessive blank lines
    clean_text = re.sub(r'\n{3,}', '\n\n', clean_text)

    return title, clean_text

#### Application of cleaning function on subset

**Add description (trying it out with a subset), the function uses `\n\n` for separation**

In [None]:
# create a subset of the dataframe for testing
df_test = df.sample(n=5).copy()

In [None]:
df_test['title'], df_test['clean_content'] = zip(*df_test['html_content'].apply(clean_html))

In [None]:
pd.set_option('display.max_colwidth', 150)
df_test[['html_content', 'title', 'clean_content']].head()

Unnamed: 0,html_content,title,clean_content
4040,"<div class=""text-image cq-dd-image"">\n<p>Should he study history or physics? As a teenager in Athens in the 1970s, Konstantinos Boulouchos was int...",,"Should he study history or physics? As a teenager in Athens in the 1970s, Konstantinos Boulouchos was interested in so many things. But while the ..."
2957,"<div class=""text-image cq-dd-image"">\n<p>The ETH Library is the host and organiser of this year's Fall Seminar of the <a class=""eth-link"" href=""ht...",,The ETH Library is the host and organiser of this year's Fall Seminar of the \n\nexternal page\n\nInternational Association of University Librarie...
3133,"<div class=""text-image cq-dd-image"">\n<p>A deuteron is a very simple atomic nucleus made up of just one proton and one neutron — that is, one each...",,"A deuteron is a very simple atomic nucleus made up of just one proton and one neutron — that is, one each of the two nuclear building blocks. An i..."
1462,"<div class=""text-image cq-dd-image"">\n<p>Gestützt auf die Resultate der <a class=""eth-link"" href=""/de/news-und-veranstaltungen/eth-news/news/2017/...",,"Gestützt auf die Resultate der \n\nAdministrativuntersuchung\n\n, die am 25. Oktober 2017 eingeleitet und im Oktober 2018 abgeschlossen wurde, lei..."
1230,"<div class=""text-image cq-dd-image"">\n<p>Bald ist Schluss. Schluss mit der Professur an der ETH, mit Forschungsprojekten und dem Unterricht. Ernst...",,"Bald ist Schluss. Schluss mit der Professur an der ETH, mit Forschungsprojekten und dem Unterricht. Ernst Hafen hat sich gut mit dem Gedanken ange..."


Below we print out the HTML and the cleaned content for each document for comparison. For the cleaned content the double newlines are replaced by `\n---PARAGRAPH BREAK---\n` for better readability.

In [None]:
from IPython.display import HTML
for idx, row in df_test.iterrows():
    print("-" * 100)
    print(f"Row {idx}:")
    display(HTML(row['html_content']))
    # Display the cleaned text
    print("\nCleaned Text:\n")
    print(row['clean_content'].replace('\n\n', '\n---PARAGRAPH BREAK---\n'))
    print(100 * "-")
    print("\n")

----------------------------------------------------------------------------------------------------
Row 4040:



Cleaned Text:

Should he study history or physics? As a teenager in Athens in the 1970s, Konstantinos Boulouchos was interested in so many things. But while the humanities seemed a bit of a dead end, he was told a physics degree was only good for becoming a teacher, an idea he didn’t quite like either. “And today I find nothing more important than educating young people,” laughs Boulouchos.
---PARAGRAPH BREAK---
The ETH professor is full of enthusiasm when he talks about his students. As a young professor, he says, he hadn’t yet recognised the importance of open and good communication. Today, he knows that everyone has their own story. Those who are absorbed by their own issues, for example, might not be able to perform as well. It’s important to talk about things, because this is the only way to provide effective support. He goes on to say that his students shaped him and helped him to develop – and he hopes he did the same for them. Judging by his former students’ numerous awards an


Cleaned Text:

The ETH Library is the host and organiser of this year's Fall Seminar of the 
---PARAGRAPH BREAK---
external page
---PARAGRAPH BREAK---
International Association of University Libraries
---PARAGRAPH BREAK---
call_made
---PARAGRAPH BREAK---
 (IATUL). For once, the event is also open to non-members of IATUL! 
---PARAGRAPH BREAK---
Please save the date now: 13–15 December 2022.
---PARAGRAPH BREAK---
Under the title "Breaking new ground: scholarly communication and libraries", speakers from libraries, science and publishing houses will shed light on the following topics over the course of three days:
---PARAGRAPH BREAK---
scholarly communication,
---PARAGRAPH BREAK---
 
---PARAGRAPH BREAK---
metrics (bibliometrics and altmetrics),
---PARAGRAPH BREAK---
 
---PARAGRAPH BREAK---
awareness in science.
---PARAGRAPH BREAK---
 
---PARAGRAPH BREAK---
We are pleased to offer you a perfect background and stage for the IATUL Fall Seminar in the surroundings of ETH Zurich and serve as 


Cleaned Text:

A deuteron is a very simple atomic nucleus made up of just one proton and one neutron — that is, one each of the two nuclear building blocks. An international research collaboration, working at the Paul Scherrer Institute PSI, has measured the deuteron more accurately than ever before. The value they obtained for the radius of the deuteron does not, however, correspond to the measurements of other research groups but instead shows a significantly smaller value.
---PARAGRAPH BREAK---
In spite of this contradiction, there is also an agreement: In 2010 the same research group reported on the measurement of individual protons by means of the same method. Then, as well, the measurement clearly showed that the proton is smaller than had been thought to date. Since then, the research community has referred to this situation as “
---PARAGRAPH BREAK---
external page
---PARAGRAPH BREAK---
the proton radius puzzle
---PARAGRAPH BREAK---
call_made
---PARAGRAPH BREAK---
.”  A  furthe


Cleaned Text:

Gestützt auf die Resultate der 
---PARAGRAPH BREAK---
Administrativuntersuchung
---PARAGRAPH BREAK---
, die am 25. Oktober 2017 eingeleitet und im Oktober 2018 abgeschlossen wurde, leitet der Präsident der ETH Zürich das Kündigungsverfahren ein. Die von einem unabhängigen externen Experten durchgeführte Administrativuntersuchung hat schwerwiegendes pflichtwidriges Verhalten über einen längeren Zeitraum hinweg festgestellt. Der Untersuchungsführer empfiehlt eine Auflösung des Arbeitsverhältnisses. Deshalb wird eine Kommission zur Prüfung der Angemessenheit der Kündigung eingesetzt, wie dies 
---PARAGRAPH BREAK---
externe Seite
---PARAGRAPH BREAK---
Art. 13 Abs. 2
---PARAGRAPH BREAK---
call_made
---PARAGRAPH BREAK---
 der Professorenverordnung vorschreibt.
---PARAGRAPH BREAK---
 
---PARAGRAPH BREAK---
 «Der Untersuchungsbericht belegt, dass es sich um inakzeptables Verhalten handelt, das wir nicht tolerieren», sagt ETH-Präsident Lino Guzzella. Er betont aber gleichzeitig,


Cleaned Text:

Bald ist Schluss. Schluss mit der Professur an der ETH, mit Forschungsprojekten und dem Unterricht. Ernst Hafen hat sich gut mit dem Gedanken angefreundet, dass seine aktive Zeit als Professor abgelaufen ist. «Es gibt nichts Unerledigtes», sagt er, am Tisch in seinem Büro sitzend.
---PARAGRAPH BREAK---
Seine Forschungsgruppe hat er aufgelöst. Nur ein Mitarbeiter ist ihm geblieben, und dieser wird sein wohl letztes Forschungsvorhaben weiterführen, bei dem es um die Gesundheit von Honigbienen geht. Seit fünf Jahren hält Hafen mit seiner Frau Bienen auf dem Dach seiner Garage in Zürich – dabei wurde er mit den Problemen der Bienen konfrontiert, allem voran mit den Varroa-Milben, welche den Völkern zusetzen. Dies habe ihn dazu inspiriert und motiviert, dieses Bienen-Projekt zu lancieren.
---PARAGRAPH BREAK---
Sein angestammtes Forschungsfeld verlässt Hafen dafür nicht. Ziel des Projekts ist mit Hilfe der Taufliege Drosophila, dem biologischen Modellsystem, dem Hafen 45 Jahr

The double newlines between header and text are not optimal. Additionally the bullet point lists are not captured as such but. Improvements would be possible but we decided to try if Docling handles the conversion already better with the predefined settings.

Another thing to note is parts of the texts don't give any useful information, such as the "Subscribe to Newsletter" and the "Staffnet" chapter or endings such as the following one: "externe Seite 10.1002/smj.2221 call_made"

#### Application of cleaning function on full Dataframe

Below we apply the conversion to all documents.

In [None]:
df['bs_html_title'], df['bs_html_content'] = zip(*df['html_content'].apply(clean_html))

In [None]:
pd.set_option('display.max_colwidth', 150)
df[['html_content', 'bs_html_title', 'bs_html_content']].head(5)

Unnamed: 0,html_content,bs_html_title,bs_html_content
0,"<div class=""text-image cq-dd-image"">\n<p>Während andere in den Sechziger-Jahren Lokführer werden wollten, war für Martin Mächler schon immer klar:...",,"Während andere in den Sechziger-Jahren Lokführer werden wollten, war für Martin Mächler schon immer klar: er würde Forscher werden. Das änderte si..."
1,"<div class=""text-image cq-dd-image"">\n<p><a class=""eth-link"" href=""http://www.library.ethz.ch/de/Ueber-uns/Aktuell/Web-of-Science-Alles-neu-macht-...",,Weitere Informationen\n\n \n\nStaffnet\n\nDas \n\nInfo-Portal\n\n für Mitarbeitende mit den wichtigsten Informationen rund um das Geschehen an der...
2,"<div class=""text-image cq-dd-image"">\n<h2>Staffnet</h2>\n<p>Das <a class=""eth-link"" href=""https://ethz.ch/services/de.html"">Info-Portal</a> für Mi...",,Staffnet\n\nDas \n\nInfo-Portal\n\n für Mitarbeitende mit den wichtigsten Informationen rund um das Geschehen an der ETH Zürich.\n\nNewsletter abo...
3,"<div class=""text-image cq-dd-image"">\n<p><a class=""eth-link"" href=""http://www.library.ethz.ch/de/Dienstleistungen/Schulungen-Tutorials-Fuehrungen/...",,Weitere Informationen\n\n \n\nStaffnet\n\nDas \n\nInfo-Portal\n\n für Mitarbeitende mit den wichtigsten Informationen rund um das Geschehen an der...
4,"<div class=""text-image cq-dd-image"">\n<p>Das Studierendenleben an der ETH dreht sich momentan nur um eines: Die anstehenden Prüfungen. Während and...",,Das Studierendenleben an der ETH dreht sich momentan nur um eines: Die anstehenden Prüfungen. Während andere sich in den Bergen beim Skifahren ver...


#### Saving the data to storage

In [None]:
# Saving the Dataframe to the Google Drive storage
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-02-bs.csv'), index=False)

### Docling

#### Definition of the Converter

In [None]:
# Create AcceleratorOptions for CUDA
cuda_accelerator_options = AcceleratorOptions(device=AcceleratorDevice.CUDA)
InputFormat.HTML: HTMLFormatOption(pipeline_options=PipelineOptions(accelerator_options=cuda_accelerator_options))

In [None]:
# Initialize the Docling converter
converter = DocumentConverter()

In [None]:
# Define function for conversion
def html_file_to_markdown(file_path):
    """Convert an HTML file to markdown using Docling"""
    try:
        # Convert the HTML file directly by path
        result = converter.convert(file_path)
        return result.document.export_to_markdown()
    except Exception as e:
        return f"Error converting file {file_path}: {str(e)}"

#### Application of conversion on subset

In [None]:
# Apply the conversion function using the 'full_path' column
df_test['markdown_content'] = df_test['full_path'].apply(html_file_to_markdown)

# View the result
df_test[['html_content', 'title', 'markdown_content']].head()

Unnamed: 0,html_content,title,markdown_content
4040,"<div class=""text-image cq-dd-image"">\n<p>Should he study history or physics? As a teenager in Athens in the 1970s, Konstantinos Boulouchos was int...",,"## What influences thinking\n\nSwitzerland has long since become a second home for the professor, who has held Swiss citizenship since 1997. Where..."
2957,"<div class=""text-image cq-dd-image"">\n<p>The ETH Library is the host and organiser of this year's Fall Seminar of the <a class=""eth-link"" href=""ht...",,"## Always up to date\n\nWould you like to always receive the most important internal information and news from ETH Zurich? Then subscribe to the ""..."
3133,"<div class=""text-image cq-dd-image"">\n<p>A deuteron is a very simple atomic nucleus made up of just one proton and one neutron — that is, one each...",,## New experiment creates excitement\n\nThe new research result is actually more than a doubling of the old mystery of the proton radius: Beyond t...
1462,"<div class=""text-image cq-dd-image"">\n<p>Gestützt auf die Resultate der <a class=""eth-link"" href=""/de/news-und-veranstaltungen/eth-news/news/2017/...",,"## Das Entlassungsverfahren gemäss Professorenverordnung\n\nDer Präsident der ETH setzt vor seinem Antrag an die Wahlbehörde, den ETH-Rat, eine Ko..."
1230,"<div class=""text-image cq-dd-image"">\n<p>Bald ist Schluss. Schluss mit der Professur an der ETH, mit Forschungsprojekten und dem Unterricht. Ernst...",,## Den Vater übertrumpfen\n\nZur Biologie kam Hafen durch seinen Biologielehrer am Gymnasium. «Bei Schweizer Jugend forscht hatte ich nie mitgemac...


In [None]:
from IPython.display import HTML
for idx, row in df_test.iterrows():
    print("-" * 100)
    print(f"Row {idx}:")
    display(HTML(row['html_content']))
    # Display the cleaned text
    print("\nCleaned Text (Markdown):\n")
    print(row['markdown_content'])
    print(100 * "-")
    print("\n")

----------------------------------------------------------------------------------------------------
Row 4040:



Cleaned Text (Markdown):

## What influences thinking

Switzerland has long since become a second home for the professor, who has held Swiss citizenship since 1997. Where does Boulouchos see the biggest differences between Greece and Switzerland? Naturally, in Switzerland everything is more orderly and structured and hence more reliable, he says. But at the end of the day, we are all Greeks, says the professor with a smile. He means this philosophically, like so much of the conversation: “The Greeks Aristotle, Heraclitus and Democritus influenced all our thinking. Logical methodology, the empirical testing of established opinions and doctrines dominate not only, but of course especially, the world of science.” For him, however, it was vital to supplement the thinking of the ancient Greeks with ideas from Eastern philosophy. What particularly fascinates Boulouchos, who loves Japan, is thinking in complex systems: “In Eastern philosophy, humans see themselves as part of a larger system,


Cleaned Text (Markdown):

## Always up to date

Would you like to always receive the most important internal information and news from ETH Zurich? Then subscribe to the "internal news" newsletter and visit Staffnet, the information portal for ETH employees.
----------------------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------------------
Row 3133:



Cleaned Text (Markdown):

## New experiment creates excitement

The new research result is actually more than a doubling of the old mystery of the proton radius: Beyond that, it can further the search for the true nature of things. “Naturally it can’t be that the deuteron — any more than the proton — has two different sizes,” says Antognini. Thus the scientific community is searching for explanations that can bring the different values back into harmony.

One possible explanation is that a physical force as yet unknown is at work. For the scientists, that is an exciting scenario; it is, however, highly improbable.

A more obvious explanation would be experimental imprecision. “Actually, the mystery could be solved very easily if we assume a minimal experimental problem with the hydrogen spectroscopy,” Antognini explains. Some of the earlier measurements, of both the proton’s size and the deuteron’s size, were based on this method.

Another method for determining the sizes of the proto


Cleaned Text (Markdown):

## Das Entlassungsverfahren gemäss Professorenverordnung

Der Präsident der ETH setzt vor seinem Antrag an die Wahlbehörde, den ETH-Rat, eine Kommission ein, welche über die Angemessenheit der Kündigung befindet und eine Empfehlung abgibt. Diese Empfehlung legt der ETH-Präsident anschliessend zusammen mit seinem eigenen Antrag dem ETH-Rat zum Entscheid vor. Für die Kommission wird die Schulleitung drei Mitglieder selbst bestimmen und zusätzlich die Konferenz des Lehrkörpers auffordern, drei Mitglieder zu bezeichnen. Drei Mitglieder der Kommission müssen Externe sein.

## Kontakt

ETH Zürich
 Medienstelle
 Telefon: +41 44 632 41 41
----------------------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------------------
Row 1230:



Cleaned Text (Markdown):

## Den Vater übertrumpfen

Zur Biologie kam Hafen durch seinen Biologielehrer am Gymnasium. «Bei Schweizer Jugend forscht hatte ich nie mitgemacht», schmunzelt er. «Ich hatte aber im Gymnasium einen sehr guten Lehrer, der mich für das Fach begeisterte.» Ein zusätzlicher Ansporn sei aber auch gewesen, dass sein Vater, ein Germanist, Deutschlehrer und Rektor am Gymnasium Münchenstein, in Biologie eher schlecht war und der Sohn eine Möglichkeit sah, seinen «Übervater» darin zu übertrumpfen. Also schrieb er für das Studium der Molekular- und Zellbiologie am Biozentrum der Universität Basel ein.

In den Vorlesungen von Walter Gehring, einem bekannten Schweizer Molekular- und Entwicklungsbiologen, stiess Ernst Hafen erstmals auf die Fliege. Gehring habe beschrieben, wie sich in den dotterreichen Insekteneier die Kerne teilen, bis es mehrere tausend davon gibt. Einige davon wandern dann zum hinteren Ende des Eies, wo sie sich in die künftigen Samen- und Eizellen umw

**There seems to be a problem with the conversion of Docling. The part of the HTML before the first heading is not included in the Markdown.**

#### Application of Conversion on full Dataframe

Below we apply the conversion to all documents.

In [None]:
df['markdown_content_docling'] = df['full_path'].progress_apply(html_file_to_markdown)

100%|██████████| 4358/4358 [00:51<00:00, 83.86it/s]


In [None]:
pd.set_option('display.max_colwidth', 150)
df[['html_content', 'markdown_content_docling']].head()

Unnamed: 0,html_content,markdown_content_docling
0,"<div class=""text-image cq-dd-image"">\n<p>Während andere in den Sechziger-Jahren Lokführer werden wollten, war für Martin Mächler schon immer klar:...",## Etwas Sinnvolles für die Menschen schaffen\n\nMehr als 30 Jahre sind seit Beginn seines Doktorats vergangen. In dieser Zeit hat er es weit gebr...
1,"<div class=""text-image cq-dd-image"">\n<p><a class=""eth-link"" href=""http://www.library.ethz.ch/de/Ueber-uns/Aktuell/Web-of-Science-Alles-neu-macht-...",## Staffnet\n\nDas Info-Portal für Mitarbeitende mit den wichtigsten Informationen rund um das Geschehen an der ETH Zürich.\n\n## Newsletter abonn...
2,"<div class=""text-image cq-dd-image"">\n<h2>Staffnet</h2>\n<p>Das <a class=""eth-link"" href=""https://ethz.ch/services/de.html"">Info-Portal</a> für Mi...",## Staffnet\n\nDas Info-Portal für Mitarbeitende mit den wichtigsten Informationen rund um das Geschehen an der ETH Zürich.\n\n## Newsletter abonn...
3,"<div class=""text-image cq-dd-image"">\n<p><a class=""eth-link"" href=""http://www.library.ethz.ch/de/Dienstleistungen/Schulungen-Tutorials-Fuehrungen/...",## Staffnet\n\nDas Info-Portal für Mitarbeitende mit den wichtigsten Informationen rund um das Geschehen an der ETH Zürich.\n\n## Newsletter abonn...
4,"<div class=""text-image cq-dd-image"">\n<p>Das Studierendenleben an der ETH dreht sich momentan nur um eines: Die anstehenden Prüfungen. Während and...","## Zur Person\n\nJulia Wysling\n\n<!-- image -->\n\nIm November 2013 wählte der Mitgliederrat, das oberste Organ des Studierendenverbands VSETH, J..."


#### Saving the data to storage

In [None]:
# Saving the Dataframe to the Google Drive storage
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-03-docling.csv'), index=False)

In [None]:
# For loading
df = pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-03-docling.csv'))

### Hybrid Approach

#### Definition of the Function for Conversion

In [None]:
def html_file_to_markdown_bs_docling(file_path):
    """Convert an HTML file to markdown, handling content before first header separately"""
    try:
        # Read the original HTML file
        with open(file_path, 'r', encoding='utf-8') as f:
            html_content = f.read()

        # Parse HTML with BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')

        # Find the first header
        first_header = soup.find(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])

        # If no header is found, just use Docling for the whole document
        if not first_header:
            result = converter.convert(file_path)
            return result.document.export_to_markdown()

        # Get the header text to use as a marker
        header_text = first_header.get_text().strip()
        header_level = int(first_header.name[1])
        header_markdown = '#' * header_level + ' ' + header_text

        # Process the entire document with Docling
        result = converter.convert(file_path)
        full_markdown = result.document.export_to_markdown()

        # Find where our header appears in the markdown
        header_index = full_markdown.find(header_markdown)

        # If the header is not found in the markdown, try finding just the header text
        if header_index == -1:
            header_index = full_markdown.find(header_text)
            if header_index == -1:
                # If we still can't find it, return the full markdown
                return full_markdown

        # Extract introduction paragraphs from HTML
        intro_paragraphs = []
        for paragraph in soup.find_all('p'):
            # Only consider paragraphs that appear before the first header
            if (
                hasattr(paragraph, 'sourceline') and
                hasattr(first_header, 'sourceline') and
                paragraph.sourceline < first_header.sourceline
            ):
                text = paragraph.get_text(strip=True)
                if text:
                    intro_paragraphs.append(text)

        # If there are intro paragraphs, combine them
        if intro_paragraphs:
            intro_markdown = "\n\n".join(intro_paragraphs)

            # Check if the intro is already in the markdown before the header
            markdown_before_header = full_markdown[:header_index].strip()

            # If intro is already included (or partially included), use the full markdown
            if intro_markdown in markdown_before_header or any(p in markdown_before_header for p in intro_paragraphs):
                return full_markdown

            # Otherwise, add the intro before the header section
            return intro_markdown + "\n\n" + full_markdown[header_index:]
        else:
            # No intro paragraphs, return the full markdown
            return full_markdown

    except Exception as e:
        return f"Error converting file {file_path}: {str(e)}"

#### Application of conversion on subset

In [None]:
# Apply the conversion function using the 'full_path' column
df_test['markdown_content_hybrid'] = df_test['full_path'].apply(html_file_to_markdown_bs_docling)

# View the result
df_test[['html_content', 'full_path', 'markdown_content_hybrid']].head()

Unnamed: 0,html_content,full_path,markdown_content_hybrid
4040,"<div class=""text-image cq-dd-image"">\n<p>Should he study history or physics? As a teenager in Athens in the 1970s, Konstantinos Boulouchos was int...",/content/drive/MyDrive/AdvGenAI/data/en_news_events/2021/06/were-all-greeks.html,"Should he study history or physics? As a teenager in Athens in the 1970s, Konstantinos Boulouchos was interested in so many things. But while the ..."
2957,"<div class=""text-image cq-dd-image"">\n<p>The ETH Library is the host and organiser of this year's Fall Seminar of the <a class=""eth-link"" href=""ht...",/content/drive/MyDrive/AdvGenAI/data/en_internal/2022/07/save-the-date-iatul-fall-seminar-2022.html,The ETH Library is the host and organiser of this year's Fall Seminar of theexternal pageInternational Association of University Librariescall_mad...
3133,"<div class=""text-image cq-dd-image"">\n<p>A deuteron is a very simple atomic nucleus made up of just one proton and one neutron — that is, one each...",/content/drive/MyDrive/AdvGenAI/data/en_news_events/2016/08/deuteron-smaller-than-thought.html,"A deuteron is a very simple atomic nucleus made up of just one proton and one neutron — that is, one each of the two nuclear building blocks. An i..."
1462,"<div class=""text-image cq-dd-image"">\n<p>Gestützt auf die Resultate der <a class=""eth-link"" href=""/de/news-und-veranstaltungen/eth-news/news/2017/...",/content/drive/MyDrive/AdvGenAI/data/de_news_events/2018/10/entlassungsverfahren-eingeleitet.html,"Gestützt auf die Resultate derAdministrativuntersuchung, die am 25. Oktober 2017 eingeleitet und im Oktober 2018 abgeschlossen wurde, leitet der P..."
1230,"<div class=""text-image cq-dd-image"">\n<p>Bald ist Schluss. Schluss mit der Professur an der ETH, mit Forschungsprojekten und dem Unterricht. Ernst...",/content/drive/MyDrive/AdvGenAI/data/de_news_events/2021/07/fliegen-daten-und-sieben-velos.html,"Bald ist Schluss. Schluss mit der Professur an der ETH, mit Forschungsprojekten und dem Unterricht. Ernst Hafen hat sich gut mit dem Gedanken ange..."


In [None]:
from IPython.display import HTML
for idx, row in df_test.iterrows():
    print("-" * 100)
    print(f"Row {idx}:")
    display(HTML(row['html_content']))
    # Display the cleaned text
    print("\nCleaned Text (Markdown Hybrid):\n")
    print(row['markdown_content_hybrid'])
    print(100 * "-")
    print("\n")

----------------------------------------------------------------------------------------------------
Row 4040:



Cleaned Text (Markdown Hybrid):

Should he study history or physics? As a teenager in Athens in the 1970s, Konstantinos Boulouchos was interested in so many things. But while the humanities seemed a bit of a dead end, he was told a physics degree was only good for becoming a teacher, an idea he didn’t quite like either. “And today I find nothing more important than educating young people,” laughs Boulouchos.

The ETH professor is full of enthusiasm when he talks about his students. As a young professor, he says, he hadn’t yet recognised the importance of open and good communication. Today, he knows that everyone has their own story. Those who are absorbed by their own issues, for example, might not be able to perform as well. It’s important to talk about things, because this is the only way to provide effective support. He goes on to say that his students shaped him and helped him to develop – and he hopes he did the same for them. Judging by his former students’ numerous awards and c


Cleaned Text (Markdown Hybrid):

The ETH Library is the host and organiser of this year's Fall Seminar of theexternal pageInternational Association of University Librariescall_made(IATUL). For once, the event is also open to non-members of IATUL!Please save the date now: 13–15 December 2022.

Under the title "Breaking new ground: scholarly communication and libraries", speakers from libraries, science and publishing houses will shed light on the following topics over the course of three days:

We are pleased to offer you a perfect background and stage for the IATUL Fall Seminar in the surroundings of ETH Zurich and serve as an inspiring site for all participants.

More details about the IATUL Fall Seminar will be made available on ourwebsitein September 2022. We are looking forward to welcoming you to this inspiring event!

Would you like to become a sponsoring partner for this year’s IATUL Fall Seminar? Please contact us via email at.

Moving forward together –Through the power of ne


Cleaned Text (Markdown Hybrid):

A deuteron is a very simple atomic nucleus made up of just one proton and one neutron — that is, one each of the two nuclear building blocks. An international research collaboration, working at the Paul Scherrer Institute PSI, has measured the deuteron more accurately than ever before. The value they obtained for the radius of the deuteron does not, however, correspond to the measurements of other research groups but instead shows a significantly smaller value.

In spite of this contradiction, there is also an agreement: In 2010 the same research group reported on the measurement of individual protons by means of the same method. Then, as well, the measurement clearly showed that the proton is smaller than had been thought to date. Since then, the research community has referred to this situation as “external pagethe proton radius puzzlecall_made.”  A  further analysis of proton data from PSI confirmed the same small value in 2013.

So now it’s the deu


Cleaned Text (Markdown Hybrid):

Gestützt auf die Resultate derAdministrativuntersuchung, die am 25. Oktober 2017 eingeleitet und im Oktober 2018 abgeschlossen wurde, leitet der Präsident der ETH Zürich das Kündigungsverfahren ein. Die von einem unabhängigen externen Experten durchgeführte Administrativuntersuchung hat schwerwiegendes pflichtwidriges Verhalten über einen längeren Zeitraum hinweg festgestellt. Der Untersuchungsführer empfiehlt eine Auflösung des Arbeitsverhältnisses. Deshalb wird eine Kommission zur Prüfung der Angemessenheit der Kündigung eingesetzt, wie diesexterne SeiteArt. 13 Abs. 2call_madeder Professorenverordnung vorschreibt.«Der Untersuchungsbericht belegt, dass es sich um inakzeptables Verhalten handelt, das wir nicht tolerieren», sagt ETH-Präsident Lino Guzzella. Er betont aber gleichzeitig, dass Verfehlungen die Ausnahme und nicht die Regel sind: «Praktisch jede und jeder der über 500 Professorinnen und Professoren der ETH Zürich leisten Tag für Tag hervorra


Cleaned Text (Markdown Hybrid):

Bald ist Schluss. Schluss mit der Professur an der ETH, mit Forschungsprojekten und dem Unterricht. Ernst Hafen hat sich gut mit dem Gedanken angefreundet, dass seine aktive Zeit als Professor abgelaufen ist. «Es gibt nichts Unerledigtes», sagt er, am Tisch in seinem Büro sitzend.

Seine Forschungsgruppe hat er aufgelöst. Nur ein Mitarbeiter ist ihm geblieben, und dieser wird sein wohl letztes Forschungsvorhaben weiterführen, bei dem es um die Gesundheit von Honigbienen geht. Seit fünf Jahren hält Hafen mit seiner Frau Bienen auf dem Dach seiner Garage in Zürich – dabei wurde er mit den Problemen der Bienen konfrontiert, allem voran mit den Varroa-Milben, welche den Völkern zusetzen. Dies habe ihn dazu inspiriert und motiviert, dieses Bienen-Projekt zu lancieren.

Sein angestammtes Forschungsfeld verlässt Hafen dafür nicht. Ziel des Projekts ist mit Hilfe der Taufliege Drosophila, dem biologischen Modellsystem, dem Hafen 45 Jahre seiner Forschungstätig

#### Application of Conversion on full Dataframe

Below we apply the conversion to all documents.

In [None]:
df['markdown_content_hybrid'] = df['full_path'].progress_apply(html_file_to_markdown_bs_docling)

100%|██████████| 4358/4358 [01:04<00:00, 67.46it/s]


In [None]:
pd.set_option('display.max_colwidth', 150)
df[['html_content', 'markdown_content_hybrid']].head()

Unnamed: 0,html_content,markdown_content_hybrid
0,"<div class=""text-image cq-dd-image"">\n<p>Während andere in den Sechziger-Jahren Lokführer werden wollten, war für Martin Mächler schon immer klar:...","Während andere in den Sechziger-Jahren Lokführer werden wollten, war für Martin Mächler schon immer klar: er würde Forscher werden. Das änderte si..."
1,"<div class=""text-image cq-dd-image"">\n<p><a class=""eth-link"" href=""http://www.library.ethz.ch/de/Ueber-uns/Aktuell/Web-of-Science-Alles-neu-macht-...",Weitere Informationen\n\n## Staffnet\n\nDas Info-Portal für Mitarbeitende mit den wichtigsten Informationen rund um das Geschehen an der ETH Züric...
2,"<div class=""text-image cq-dd-image"">\n<h2>Staffnet</h2>\n<p>Das <a class=""eth-link"" href=""https://ethz.ch/services/de.html"">Info-Portal</a> für Mi...",## Staffnet\n\nDas Info-Portal für Mitarbeitende mit den wichtigsten Informationen rund um das Geschehen an der ETH Zürich.\n\n## Newsletter abonn...
3,"<div class=""text-image cq-dd-image"">\n<p><a class=""eth-link"" href=""http://www.library.ethz.ch/de/Dienstleistungen/Schulungen-Tutorials-Fuehrungen/...",Weitere Informationen\n\n## Staffnet\n\nDas Info-Portal für Mitarbeitende mit den wichtigsten Informationen rund um das Geschehen an der ETH Züric...
4,"<div class=""text-image cq-dd-image"">\n<p>Das Studierendenleben an der ETH dreht sich momentan nur um eines: Die anstehenden Prüfungen. Während and...",Das Studierendenleben an der ETH dreht sich momentan nur um eines: Die anstehenden Prüfungen. Während andere sich in den Bergen beim Skifahren ver...


#### Saving the data to storage

In [None]:
# Saving the Dataframe to the Google Drive storage
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-04-hybrid.csv'), index=False)

# Multilingual Text Preprocessing and Cleaning (5 Points)

In [None]:
# For loading
df = pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-04-hybrid.csv'))

## Preprocessing

Perform necessary text preprocessing (e.g., removing extra spaces and redundant line breaks, normalizing
Unicode characters, standardizing date formats from different sources), and handle German-specific text
processing (e.g., compound words, umlaut normalization if needed)

In [None]:
df["markdown_content_hybrid"].head()

Unnamed: 0,markdown_content_hybrid
0,Während andere in den Sechziger-Jahren Lokführ...
1,Weitere Informationen\n\n## Staffnet\n\nDas In...
2,## Staffnet\n\nDas Info-Portal für Mitarbeiten...
3,Weitere Informationen\n\n## Staffnet\n\nDas In...
4,Das Studierendenleben an der ETH dreht sich mo...


## Metadata

Store the cleaned text and its metadata in a structured format suitable for retrieval (e.g., JSON, CSV, or a
database) with fields such as **language, title, date, source**

The following step enriches the dataframe by adding the date (month and year) extracted from the file path, as they are organized by date.

In [None]:
# Function to extract year and month from the folder path
def extract_year_month(path):

    if isinstance(path, str):  # Check if path is a string.
        parts = path.split('/')
        if len(parts) >= 2: # Check if the path has at least two parts
            month = parts[-1]
            year = parts[-2]
            return year, month
        else:
             return None, None
    else:
        return None, None #Handles the case where the input is not a string

# Apply the function to create new columns 'year' and 'month'
df[['year', 'month']] = df['folder_path'].apply(lambda x: pd.Series(extract_year_month(x)))


Extracts the type of document title and the language from the folder path structure.

In [None]:
def extract_language_type(path):
    if isinstance(path, str):
        parts = path.split('/')
        if len(parts) >= 4:  # Check for the third element from the end
            third_from_end = parts[-3]
            lang_type_parts = third_from_end.split('_')
            language = lang_type_parts[0] if lang_type_parts[0] in ('de', 'en') else None
            Type = 'internal' if len(lang_type_parts) > 1 and lang_type_parts[1] == 'internal' else \
                   'news events' if len(lang_type_parts) > 1 and lang_type_parts[1] == 'news' else None
            return language, Type
        else:
            return None, None
    else:
        return None, None

# Apply the function to create new columns 'language' and 'Type'
df[['language', 'type']] = df['folder_path'].apply(lambda x: pd.Series(extract_language_type(x)))

Extracts HTML file name and adds it to the dataframe. This could be useful as a backup title, as many files do not contain a `<meta>` tag with the title or a `<h1>` tag for the title.

In [None]:
# Function to extract and format the title from the file_name
def extract_and_format_title(file_name):
    """
    Extracts the title from the filename, removes the '.html' extension,
    replaces hyphens with spaces, and capitalizes only the first letter of the first word.

    Args:
        file_name (str): The filename.

    Returns:
        str: The formatted title, or None if the input is not a string.
    """
    if isinstance(file_name, str):
        title = file_name.replace(".html", "").replace("-", " ")
        words = title.split()
        if words:
            words[0] = words[0].capitalize()
            title = " ".join(words)
        return title
    else:
        return None

# Apply the function to create the 'html_title' column
df['html_title'] = df['file_name'].apply(extract_and_format_title)

In [None]:
# Print the updated DataFrame
print(df.head())

                                         folder_path  \
0  /content/drive/MyDrive/AdvGenAI/data/de_intern...   
1  /content/drive/MyDrive/AdvGenAI/data/de_intern...   
2  /content/drive/MyDrive/AdvGenAI/data/de_intern...   
3  /content/drive/MyDrive/AdvGenAI/data/de_intern...   
4  /content/drive/MyDrive/AdvGenAI/data/de_intern...   

                                          file_name  \
0                               der-r-pionier1.html   
1  web-of-science-alles-neu-macht-der--januar-.html   
2    swiss-life-sciences-2014-experten-gesucht.html   
3                 mendeley-literaturverwaltung.html   
4                            mehr-orientierung.html   

                                           full_path  \
0  /content/drive/MyDrive/AdvGenAI/data/de_intern...   
1  /content/drive/MyDrive/AdvGenAI/data/de_intern...   
2  /content/drive/MyDrive/AdvGenAI/data/de_intern...   
3  /content/drive/MyDrive/AdvGenAI/data/de_intern...   
4  /content/drive/MyDrive/AdvGenAI/data/de_intern...

In [None]:
print(df.columns)
# drop 'bs_html_content'and 'markdown_content_docling' columns
df1 = df.drop(columns=['bs_html_content', 'markdown_content_docling'])
print(df1.columns)

Index(['folder_path', 'file_name', 'full_path', 'html_content',
       'bs_html_title', 'bs_html_content', 'markdown_content_docling',
       'markdown_content_hybrid', 'year', 'month', 'language', 'type',
       'html_title'],
      dtype='object')
Index(['folder_path', 'file_name', 'full_path', 'html_content',
       'bs_html_title', 'markdown_content_hybrid', 'year', 'month', 'language',
       'type', 'html_title'],
      dtype='object')


In [None]:
# save file as csv
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-05-metadata.csv'), index=False)

## Metadata Extraction from Content (NLP)

In the following steps, information about the text will be extracted and added as metadata to the dataframe, fields such as **main content, named entities, topics, keywords, summary**

In [None]:
# load step 5 dataset
pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-05-metadata.csv'))

Performs necessary text preprocessing (e.g., removing extra spaces and redundant line breaks, normalizing
Unicode characters, and handle German-specific text
processing 'ß'.

We decided that there is no need to process and unify date structure as the date is already stracted from the path of the files. We might lose some information regarding the day if included in some texts, but with a span of over a decade, that level of granularity wont be necessary.

Regarding the German specific processing such as umlaut or compound words:
- After normalizing unicode characters, we can see that umlauts and other characters are not an issue and are displayed proyerly, so there is no need for further processing.
- Compound words are an essential part of the German language, and processing it could affect negatively the meaning and understandability of the text explained, as explained by our german speaking colleague Pascal, hence we decided against of processing them.

In [None]:


#print an example of the text in the markdown_content_hybrid column
print(df['markdown_content_hybrid'].iloc[0])

def preprocess_text(text):
	# Remove extra spaces and redundant line breaks
	text = re.sub(r'\s+', ' ', text)
	text = text.strip()

	# Normalize Unicode characters (if needed)
	text = text.encode('utf-8').decode('utf-8')

	# Handle German-specific text processing (e.g., compound words, umlaut normalization)
	text = text.replace('ß', 'ss')

	return text

# Apply the preprocessing function to the 'markdown_content_hybrid' column
df['content'] = df['markdown_content_hybrid'].progress_apply(preprocess_text)

print(df['content'].iloc[0])

The next step processes the (`content` column) and its language (`language` column)to perform the following tasks:

1. **Loads NLP Models**: Initializes spaCy models for English and German text processing.
2. **Extracts Features**: Defines a function `extract_all` to extract:
   - Named entities using spaCy.
   - Keywords using KeyBERT.
   - Topics using Gensim's LDA.
   - A summary (heuristic: two longest sentences).
3. **Applies the Function**: Processes each row of the DataFrame using `progress_apply` and adds the extracted features as new columns (`named_entities`, `topics`, `keywords`, `summary`).
4. **Displays Results**: Prints the first few rows of the updated DataFrame.

In [None]:
# --- Install required packages if needed ---
# pip install spacy keybert gensim tqdm
# python -m spacy download en_core_web_sm
# python -m spacy download de_core_news_sm

import pandas as pd
import spacy
from keybert import KeyBERT
from gensim import corpora, models
from tqdm import tqdm

# Load spaCy models
nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")

# Init KeyBERT
kw_model = KeyBERT()

# Enable progress_apply
tqdm.pandas()

# --- Define the extraction function ---
def extract_all(text, lang):
    # Select language-specific spaCy model
    if lang == "de":
        nlp = nlp_de
    else:
        nlp = nlp_en

    doc = nlp(text)

    # Named Entities
    entities = list(set((ent.text, ent.label_) for ent in doc.ents))

    # Keywords
    keywords = kw_model.extract_keywords(text, top_n=5)
    keywords = [kw[0] for kw in keywords]

    # Topics using LDA (via Gensim)
    tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
    dictionary = corpora.Dictionary([tokens])
    corpus = [dictionary.doc2bow(tokens)]
    try:
        lda_model = models.LdaModel(corpus, num_topics=1, id2word=dictionary, passes=4)
        topics = [word for word, _ in lda_model.show_topic(0)]
    except:
        topics = []

    # Summary (heuristic: longest 2 sentences)
    sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 20]
    sentences = sorted(sentences, key=lambda s: len(s), reverse=True)
    summary = " ".join(sentences[:2]) if sentences else ""

    return pd.Series({
        "named_entities": entities,
        "topics": topics,
        "keywords": keywords,
        "summary": summary
    })

# --- Apply it to your DataFrame ---
# Make sure your df has 'content' and 'language' columns
df = df.dropna(subset=["content", "language"])
df[["named_entities", "topics", "keywords", "summary"]] = df.progress_apply(
    lambda row: extract_all(row["content"], row["language"]),
    axis=1
)

# Optional: display result
print(df[["content", "language", "named_entities", "topics", "keywords", "summary"]].head())


In [None]:
# save the updated dataframe to a new CSV file
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-06-NLP-processed.csv'), index=False)

In [44]:
# load dataset
df = pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-06-NLP-processed.csv'))

This code uses `sumy` library to extract the summary from the `content` using Latent Semantic Analysis (LSA). Tt processes the text content, applies the `LsaSummarizer` to extract key sentences based on topic relevance, and removes stop words to enhance readability. The summarization function is then applied to each row in the DataFrame, creating a new column with summarized text

In [45]:
import pandas as pd
from sumy.parsers.plaintext import PlaintextParser  # Corrected import
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.utils import get_stop_words

import nltk
nltk.download('punkt_tab')

# Function to summarize text based on language
i=0
def summarize_text(row):
    global i
    language = row["language"]
    parser = PlaintextParser.from_string(row["content"], Tokenizer(language))  # Corrected usage
    summarizer = LsaSummarizer()
    summarizer.stop_words = get_stop_words(language)

    summary_sentences = summarizer(parser.document, 2)  # Get top 2 sentences
    i=i+1
    print(i)
    return " ".join([str(sentence) for sentence in summary_sentences])

# Apply summarization
df["summary_summy"] = df.apply(summarize_text, axis=1)

print(df)


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [47]:
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-07-summaryv2.csv'), index=False)