<a href="https://colab.research.google.com/github/JackGraymer/Advanced-GenAI/blob/main/1_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Generative Artificial Intelligence
**Project - Designing a RAG-Based Q&A System for News Retrieval**

**Authors:** Vsevolod Mironov, Pascal Küng, Alvaro Cervan (Group 5)


# Step 1 - Data preparation

**Contribution:** Vsevolod Mironov, Pascal Küng, Alvaro Cervan

**Goal of this step:** To create a clean and well-structured multilingual dataset of news articles with enriched metadata, optimized for efficient indexing and retrieval in a future RAG-based system.

# Loading, Parsing, and Cleaning HTML Files (5 Points)

## 1. Setup of the environment

Below the necessary libraries are installed and loaded into the environment.

In [None]:
!nvidia-smi 

In [None]:
!pip list

In [None]:
!pip install -q docling==2.31.0

In [None]:
from bs4 import BeautifulSoup, Comment
import docling
from docling.document_converter import DocumentConverter, InputFormat, HTMLFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import PipelineOptions

In [None]:
!pip install -q beautifulsoup4==4.13.4
!pip install -q docling==2.31.0
from bs4 import BeautifulSoup, Comment
import docling
from docling.document_converter import DocumentConverter, InputFormat, HTMLFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import PipelineOptions

In [None]:
import os
import re
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
import matplotlib.pyplot as plt
import tempfile

In [None]:
# Set the seed for consistent results
seed_value = 2138247234
random.seed(seed_value)
np.random.seed(seed_value)
os.environ['PYTHONHASHSEED'] = str(seed_value)

Below we mount a shared Google Drive folder as a data storage and define the base path of the folder that will be used in the runtime.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
base_folder = '/content/drive/MyDrive/AdvGenAI'

## 2. Loading the raw data

### Loading

We go through the subdirectories inside the data-folder. Inside those folders the individual html-files will be read and the content will be saved together with the information of the file-name and the path of the file (to store in which subfolder it was located).

In [None]:
# Definition of data folder
data_folder = os.path.join(base_folder, 'data')

In [None]:
%%time
# List to hold the dictionaries
data = []

# Walk through all directories and subdirectories
for root, dirs, files in os.walk(data_folder):
    for file in files:
        if file.endswith('.html'):
            file_path = os.path.join(root, file)

            # Read the content of the HTML file
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()

            # Add a dictionary to the list
            data.append({
                'folder_path': root,
                'file_name': file,
                'full_path': file_path,
                'html_content': content
            })

# Convert to DataFrame
df = pd.DataFrame(data)

# Optionally save DataFrame, e.g. to CSV or pickle for later use
# df.to_csv('html_files_content.csv', index=False)
# df.to_pickle('html_files_content.pkl')

# Show first rows to verify
print(df.head())

In [None]:
pd.set_option('display.max_colwidth', 50)
df.head()

In [None]:
print(df.iloc[10:100, 2].values)

In [None]:
print(df.iloc[-100:-1, 2].values)

### Checking completeness of loading

**Number of files**

Below we compare the number of documents collected by the function into the Dataframe with a selection of all files in the data folder.

In the check 3 files were discovered that were not part of the dataframe. After inspection it was discovered that those are `.DS_Store`file, for which it makes sense that they were not included.

In [None]:
# Dataframe
print(f"Number of files in the DataFrame: {len(df)}")

In [None]:
# Files in Data folder
print(f"Number of files in the data folder:")
!find "$data_folder" -type f | wc -l

In [None]:
!find "$data_folder" -type f | sort > folder_files.txt
df['full_path'].sort_values().to_csv('df_files.txt', index=False, header=False)
!sort folder_files.txt -o folder_files.txt
!sort df_files.txt -o df_files.txt
!comm -23 folder_files.txt df_files.txt

**Checking for empty files**

Below we print out the rows of the dataframe with empty contents.

In [None]:
df[df['html_content'].isna()].head(10)

In [None]:
print(f"Files with no content: {len(df[df['html_content']==''])}")
df[df['html_content']==""].head(10)

In [None]:
print(df.loc[df['html_content']=="", "full_path"].values)

After checking the files in the original data source we concluded that those files were empty files and therefore it was not a problem in the process of the data loading. We therefore exclude those rows from the dataframe.

In [None]:
len(df)

In [None]:
df = df[~df['html_content'].isna()].copy()
df = df[~(df['html_content']=="")].copy()

In [None]:
len(df)

In [None]:
print(df.iloc[:,3].sort_values().values)

In [None]:
print(df.shape)  # Output: (rows, columns)
print(df.columns)  # List all column names

### Saving the data to storage

In [None]:
# Saving the Dataframe to the Google Drive storage
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-01-raw-data.csv'), index=False)

## 3. Parsing and cleaning the HTML files

### Loading the data from storage

In [None]:
# Load csv from Google Drive Storage to Dataframe
df = pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-01-raw-data.csv'))

In [None]:
pd.set_option('display.max_colwidth', 300)
df[["html_content"]].head(5)

### BeautifulSoup

#### Definition of cleaning function

Below a function is defined to clean the stored strings of the html-files using `BeautifulSoup`. It extracts the title and main texts of the documents while removing various elements that are not of interest for the further analysis (for example style and navigation elements).

In [None]:
def clean_html(html_content):
    from bs4 import BeautifulSoup, Comment
    import re

    soup = BeautifulSoup(html_content, 'html.parser')

    # Title extraction
    title = soup.title.get_text(strip=True) if soup.title else ''

    # Remove unwanted elements
    for el in soup(['script', 'style', 'header', 'footer', 'nav', 'iframe', 'meta', 'link']):
        el.decompose()
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Replace <br> with newline
    for br in soup.find_all("<br/>"):
        br.replace_with("\n")

    # Get the content from body if exists
    content = soup.body or soup

    # get_text with separator
    clean_text = content.get_text(separator='\n\n').strip()

    # Post-process: Collapse excessive blank lines
    clean_text = re.sub(r'\n{3,}', '\n\n', clean_text)

    return title, clean_text

#### Application of cleaning function on subset

**Add description (trying it out with a subset), the function uses `\n\n` for separation**

In [None]:
# create a subset of the dataframe for testing
df_test = df.sample(n=5).copy()

In [None]:
df_test['title'], df_test['clean_content'] = zip(*df_test['html_content'].apply(clean_html))

In [None]:
pd.set_option('display.max_colwidth', 150)
df_test[['html_content', 'title', 'clean_content']].head()

Below we print out the HTML and the cleaned content for each document for comparison. For the cleaned content the double newlines are replaced by `\n---PARAGRAPH BREAK---\n` for better readability.

In [None]:
from IPython.display import HTML
for idx, row in df_test.iterrows():
    print("-" * 100)
    print(f"Row {idx}:")
    display(HTML(row['html_content']))
    # Display the cleaned text
    print("\nCleaned Text:\n")
    print(row['clean_content'].replace('\n\n', '\n---PARAGRAPH BREAK---\n'))
    print(100 * "-")
    print("\n")

The double newlines between header and text are not optimal. Additionally the bullet point lists are not captured as such but. Improvements would be possible but we decided to try if Docling handles the conversion already better with the predefined settings.

Another thing to note is parts of the texts don't give any useful information, such as the "Subscribe to Newsletter" and the "Staffnet" chapter or endings such as the following one: "externe Seite 10.1002/smj.2221 call_made"

#### Application of cleaning function on full Dataframe

Below we apply the conversion to all documents.

In [None]:
df['bs_html_title'], df['bs_html_content'] = zip(*df['html_content'].apply(clean_html))

In [None]:
pd.set_option('display.max_colwidth', 150)
df[['html_content', 'bs_html_title', 'bs_html_content']].head(5)

#### Saving the data to storage

In [None]:
# Saving the Dataframe to the Google Drive storage
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-02-bs.csv'), index=False)

### Docling

#### Definition of the Converter

In [None]:
# Create AcceleratorOptions for CUDA
cuda_accelerator_options = AcceleratorOptions(device=AcceleratorDevice.CUDA)
InputFormat.HTML: HTMLFormatOption(pipeline_options=PipelineOptions(accelerator_options=cuda_accelerator_options))

In [None]:
# Initialize the Docling converter
converter = DocumentConverter()

In [None]:
# Define function for conversion
def html_file_to_markdown(file_path):
    """Convert an HTML file to markdown using Docling"""
    try:
        # Convert the HTML file directly by path
        result = converter.convert(file_path)
        return result.document.export_to_markdown()
    except Exception as e:
        return f"Error converting file {file_path}: {str(e)}"

#### Application of conversion on subset

In [None]:
# Apply the conversion function using the 'full_path' column
df_test['markdown_content'] = df_test['full_path'].apply(html_file_to_markdown)

# View the result
df_test[['html_content', 'title', 'markdown_content']].head()

In [None]:
from IPython.display import HTML
for idx, row in df_test.iterrows():
    print("-" * 100)
    print(f"Row {idx}:")
    display(HTML(row['html_content']))
    # Display the cleaned text
    print("\nCleaned Text (Markdown):\n")
    print(row['markdown_content'])
    print(100 * "-")
    print("\n")

**There seems to be a problem with the conversion of Docling. The part of the HTML before the first heading is not included in the Markdown.**

#### Application of Conversion on full Dataframe

Below we apply the conversion to all documents.

In [None]:
df['markdown_content_docling'] = df['full_path'].progress_apply(html_file_to_markdown)

In [None]:
pd.set_option('display.max_colwidth', 150)
df[['html_content', 'markdown_content_docling']].head()

#### Saving the data to storage

In [None]:
# Saving the Dataframe to the Google Drive storage
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-03-docling.csv'), index=False)

In [None]:
# For loading
df = pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-03-docling.csv'))

### Hybrid Approach

#### Definition of the Function for Conversion

In [None]:
def html_file_to_markdown_bs_docling(file_path):
    """Convert an HTML file to markdown, handling content before first header separately"""
    try:
        # Read the original HTML file
        with open(file_path, 'r', encoding='utf-8') as f:
            html_content = f.read()

        # Parse HTML with BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')

        # Find the first header
        first_header = soup.find(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])

        # If no header is found, just use Docling for the whole document
        if not first_header:
            result = converter.convert(file_path)
            return result.document.export_to_markdown()

        # Get the header text to use as a marker
        header_text = first_header.get_text().strip()
        header_level = int(first_header.name[1])
        header_markdown = '#' * header_level + ' ' + header_text

        # Process the entire document with Docling
        result = converter.convert(file_path)
        full_markdown = result.document.export_to_markdown()

        # Find where our header appears in the markdown
        header_index = full_markdown.find(header_markdown)

        # If the header is not found in the markdown, try finding just the header text
        if header_index == -1:
            header_index = full_markdown.find(header_text)
            if header_index == -1:
                # If we still can't find it, return the full markdown
                return full_markdown

        # Extract introduction paragraphs from HTML
        intro_paragraphs = []
        for paragraph in soup.find_all('p'):
            # Only consider paragraphs that appear before the first header
            if (
                hasattr(paragraph, 'sourceline') and
                hasattr(first_header, 'sourceline') and
                paragraph.sourceline < first_header.sourceline
            ):
                text = paragraph.get_text(strip=True)
                if text:
                    intro_paragraphs.append(text)

        # If there are intro paragraphs, combine them
        if intro_paragraphs:
            intro_markdown = "\n\n".join(intro_paragraphs)

            # Check if the intro is already in the markdown before the header
            markdown_before_header = full_markdown[:header_index].strip()

            # If intro is already included (or partially included), use the full markdown
            if intro_markdown in markdown_before_header or any(p in markdown_before_header for p in intro_paragraphs):
                return full_markdown

            # Otherwise, add the intro before the header section
            return intro_markdown + "\n\n" + full_markdown[header_index:]
        else:
            # No intro paragraphs, return the full markdown
            return full_markdown

    except Exception as e:
        return f"Error converting file {file_path}: {str(e)}"

#### Application of conversion on subset

In [None]:
# Apply the conversion function using the 'full_path' column
df_test['markdown_content_hybrid'] = df_test['full_path'].apply(html_file_to_markdown_bs_docling)

# View the result
df_test[['html_content', 'full_path', 'markdown_content_hybrid']].head()

In [None]:
from IPython.display import HTML
for idx, row in df_test.iterrows():
    print("-" * 100)
    print(f"Row {idx}:")
    display(HTML(row['html_content']))
    # Display the cleaned text
    print("\nCleaned Text (Markdown Hybrid):\n")
    print(row['markdown_content_hybrid'])
    print(100 * "-")
    print("\n")

#### Application of Conversion on full Dataframe

Below we apply the conversion to all documents.

In [None]:
df['markdown_content_hybrid'] = df['full_path'].progress_apply(html_file_to_markdown_bs_docling)

In [None]:
pd.set_option('display.max_colwidth', 150)
df[['html_content', 'markdown_content_hybrid']].head()

#### Saving the data to storage

In [None]:
# Saving the Dataframe to the Google Drive storage
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-04-hybrid.csv'), index=False)

# Multilingual Text Preprocessing and Cleaning (5 Points)

In [None]:
# For loading
df = pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-04-hybrid.csv'))

## Preprocessing

Perform necessary text preprocessing (e.g., removing extra spaces and redundant line breaks, normalizing
Unicode characters, standardizing date formats from different sources), and handle German-specific text
processing (e.g., compound words, umlaut normalization if needed)

In [None]:
df["markdown_content_hybrid"].head()

## Metadata

Store the cleaned text and its metadata in a structured format suitable for retrieval (e.g., JSON, CSV, or a
database) with fields such as **language, title, date, source**

The following step enriches the dataframe by adding the date (month and year) extracted from the file path, as they are organized by date.

In [None]:
# Function to extract year and month from the folder path
def extract_year_month(path):

    if isinstance(path, str):  # Check if path is a string.
        parts = path.split('/')
        if len(parts) >= 2: # Check if the path has at least two parts
            month = parts[-1]
            year = parts[-2]
            return year, month
        else:
             return None, None
    else:
        return None, None #Handles the case where the input is not a string

# Apply the function to create new columns 'year' and 'month'
df[['year', 'month']] = df['folder_path'].apply(lambda x: pd.Series(extract_year_month(x)))


Extracts the type of document title and the language from the folder path structure.

In [None]:
def extract_language_type(path):
    if isinstance(path, str):
        parts = path.split('/')
        if len(parts) >= 4:  # Check for the third element from the end
            third_from_end = parts[-3]
            lang_type_parts = third_from_end.split('_')
            language = lang_type_parts[0] if lang_type_parts[0] in ('de', 'en') else None
            Type = 'internal' if len(lang_type_parts) > 1 and lang_type_parts[1] == 'internal' else \
                   'news events' if len(lang_type_parts) > 1 and lang_type_parts[1] == 'news' else None
            return language, Type
        else:
            return None, None
    else:
        return None, None

# Apply the function to create new columns 'language' and 'Type'
df[['language', 'type']] = df['folder_path'].apply(lambda x: pd.Series(extract_language_type(x)))

Extracts HTML file name and adds it to the dataframe. This could be useful as a backup title, as many files do not contain a `<meta>` tag with the title or a `<h1>` tag for the title.

In [None]:
# Function to extract and format the title from the file_name
def extract_and_format_title(file_name):
    """
    Extracts the title from the filename, removes the '.html' extension,
    replaces hyphens with spaces, and capitalizes only the first letter of the first word.

    Args:
        file_name (str): The filename.

    Returns:
        str: The formatted title, or None if the input is not a string.
    """
    if isinstance(file_name, str):
        title = file_name.replace(".html", "").replace("-", " ")
        words = title.split()
        if words:
            words[0] = words[0].capitalize()
            title = " ".join(words)
        return title
    else:
        return None

# Apply the function to create the 'html_title' column
df['html_title'] = df['file_name'].apply(extract_and_format_title)

In [None]:
# Print the updated DataFrame
print(df.head())

In [None]:
print(df.columns)
# drop 'bs_html_content'and 'markdown_content_docling' columns
df1 = df.drop(columns=['bs_html_content', 'markdown_content_docling'])
print(df1.columns)

In [None]:
# save file as csv
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-05-metadata.csv'), index=False)

## Metadata Extraction from Content (NLP)

In the following steps, information about the text will be extracted and added as metadata to the dataframe, fields such as **main content, named entities, topics, keywords, summary**

In [None]:
# load step 5 dataset
pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-05-metadata.csv'))

Performs necessary text preprocessing (e.g., removing extra spaces and redundant line breaks, normalizing
Unicode characters, and handle German-specific text
processing 'ß'.

We decided that there is no need to process and unify date structure as the date is already stracted from the path of the files. We might lose some information regarding the day if included in some texts, but with a span of over a decade, that level of granularity wont be necessary.

Regarding the German specific processing such as umlaut or compound words:
- After normalizing unicode characters, we can see that umlauts and other characters are not an issue and are displayed proyerly, so there is no need for further processing.
- Compound words are an essential part of the German language, and processing it could affect negatively the meaning and understandability of the text explained, as explained by our german speaking colleague Pascal, hence we decided against of processing them.

In [None]:


#print an example of the text in the markdown_content_hybrid column
print(df['markdown_content_hybrid'].iloc[0])

def preprocess_text(text):
	# Remove extra spaces and redundant line breaks
	text = re.sub(r'\s+', ' ', text)
	text = text.strip()

	# Normalize Unicode characters (if needed)
	text = text.encode('utf-8').decode('utf-8')

	# Handle German-specific text processing (e.g., compound words, umlaut normalization)
	text = text.replace('ß', 'ss')

	return text

# Apply the preprocessing function to the 'markdown_content_hybrid' column
df['content'] = df['markdown_content_hybrid'].progress_apply(preprocess_text)

print(df['content'].iloc[0])

The next step processes the (`content` column) and its language (`language` column)to perform the following tasks:

1. **Loads NLP Models**: Initializes spaCy models for English and German text processing.
2. **Extracts Features**: Defines a function `extract_all` to extract:
   - Named entities using spaCy.
   - Keywords using KeyBERT.
   - Topics using Gensim's LDA.
   - A summary (heuristic: two longest sentences).
3. **Applies the Function**: Processes each row of the DataFrame using `progress_apply` and adds the extracted features as new columns (`named_entities`, `topics`, `keywords`, `summary`).
4. **Displays Results**: Prints the first few rows of the updated DataFrame.

In [None]:
# --- Install required packages if needed ---
# pip install spacy keybert gensim tqdm
# python -m spacy download en_core_web_sm
# python -m spacy download de_core_news_sm

import pandas as pd
import spacy
from keybert import KeyBERT
from gensim import corpora, models
from tqdm import tqdm

# Load spaCy models
nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")

# Init KeyBERT
kw_model = KeyBERT()

# Enable progress_apply
tqdm.pandas()

# --- Define the extraction function ---
def extract_all(text, lang):
    # Select language-specific spaCy model
    if lang == "de":
        nlp = nlp_de
    else:
        nlp = nlp_en

    doc = nlp(text)

    # Named Entities
    entities = list(set((ent.text, ent.label_) for ent in doc.ents))

    # Keywords
    keywords = kw_model.extract_keywords(text, top_n=5)
    keywords = [kw[0] for kw in keywords]

    # Topics using LDA (via Gensim)
    tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
    dictionary = corpora.Dictionary([tokens])
    corpus = [dictionary.doc2bow(tokens)]
    try:
        lda_model = models.LdaModel(corpus, num_topics=1, id2word=dictionary, passes=4)
        topics = [word for word, _ in lda_model.show_topic(0)]
    except:
        topics = []

    # Summary (heuristic: longest 2 sentences)
    sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 20]
    sentences = sorted(sentences, key=lambda s: len(s), reverse=True)
    summary = " ".join(sentences[:2]) if sentences else ""

    return pd.Series({
        "named_entities": entities,
        "topics": topics,
        "keywords": keywords,
        "summary": summary
    })

# --- Apply it to your DataFrame ---
# Make sure your df has 'content' and 'language' columns
df = df.dropna(subset=["content", "language"])
df[["named_entities", "topics", "keywords", "summary"]] = df.progress_apply(
    lambda row: extract_all(row["content"], row["language"]),
    axis=1
)

# Optional: display result
print(df[["content", "language", "named_entities", "topics", "keywords", "summary"]].head())


In [None]:
# save the updated dataframe to a new CSV file
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-06-NLP-processed.csv'), index=False)

In [None]:
# load dataset
df = pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-06-NLP-processed.csv'))

### Evaluation of Summary

This code uses `sumy` library to extract the summary from the `content` using Latent Semantic Analysis (LSA). Tt processes the text content, applies the `LsaSummarizer` to extract key sentences based on topic relevance, and removes stop words to enhance readability. The summarization function is then applied to each row in the DataFrame, creating a new column with summarized text

In [None]:
import pandas as pd
from sumy.parsers.plaintext import PlaintextParser  # Corrected import
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.utils import get_stop_words

import nltk
nltk.download('punkt_tab')

# Function to summarize text based on language
i=0
def summarize_text(row):
    global i
    language = row["language"]
    parser = PlaintextParser.from_string(row["content"], Tokenizer(language))  # Corrected usage
    summarizer = LsaSummarizer()
    summarizer.stop_words = get_stop_words(language)

    summary_sentences = summarizer(parser.document, 2)  # Get top 2 sentences
    i=i+1
    print(i)
    return " ".join([str(sentence) for sentence in summary_sentences])

# Apply summarization
df["summary_summy"] = df.apply(summarize_text, axis=1)

print(df)


In [None]:
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-07-summaryv2.csv'), index=False)

In [None]:
# load dataset
df = pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-07-summaryv2.csv'))
print(df.columns)

In [None]:
!pip install bert_score

In [None]:
import pandas as pd
from bert_score import score

# Make sure content and summaries are strings
df = df.dropna(subset=['content', 'summary', 'summary_summy'])
contents = df['content'].astype(str).tolist()
summaries_1 = df['summary'].astype(str).tolist()
summaries_2 = df['summary_summy'].astype(str).tolist()

# Compute BERTScore (Precision, Recall, F1) for both sets of summaries
P1, R1, F1 = score(summaries_1, contents, lang="en", verbose=True)
P2, R2, F2 = score(summaries_2, contents, lang="en", verbose=True)

# Add results to DataFrame
df['bertscore_summary_f1'] = F1.tolist()
df['bertscore_summy_f1'] = F2.tolist()

# Compare average F1 scores
mean_f1_summary = F1.mean().item()
mean_f1_summy = F2.mean().item()

print(f"Mean BERTScore F1 - summary:      {mean_f1_summary:.4f}")
print(f"Mean BERTScore F1 - summary_summy: {mean_f1_summy:.4f}")

The result of comparing both summaries ends in a tie:
- Mean BERTScore F1 - summary:      0.8648
- Mean BERTScore F1 - summary_summy: 0.8640

In [None]:
# filter the dataframe by 'language' = 'en' and take 10 random samples with a fixed seed
sample = df[df['language'] == 'en'].sample(10, random_state=42)[['summary', 'summary_summy']]

# print the samples in pairs
for i, row in sample.iterrows():
	print(f"Summary: {row['summary']}")
	print(f"Summy: {row['summary_summy']}")
	print()

The heuristic-based approach selects the longest sentences, assuming they are the most relevant, but often fails to capture the true essence of the content, producing long chunks without contextual analysis.

In contrast, summy uses advanced algorithms to extract meaningful sentences, creating concise and focused summaries based on the data. When summaries differ, summy consistently provides more context-aware and coherent outputs by inferring importance rather than relying on sentence length.

While both methods may achieve similar BERT scores, summy demonstrates a deeper understanding of the content, producing summaries that are both accurate and contextually meaningful.

Hence we will keep only the sumy summary and delete the helper columns on the dataframe

In [None]:
# Load again the dataset to avoid deleting several columns created in the previous steps
df = pd.read_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-07-summaryv2.csv'))

# Delete 'summary' column
df.drop(columns=['summary'], inplace=True)

# rename column 'summary_summy' to summary
df.rename(columns={'summary_summy': 'summary'}, inplace=True)
print(df.columns)



In [None]:
df.to_csv(os.path.join(base_folder, 'Stage1/Working-dir/Stage1-08-final.csv'), index=False)