<a href="https://colab.research.google.com/github/JackGraymer/Advanced-GenAI/blob/main/1_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Generative Artificial Intelligence
**Project - Designing a RAG-Based Q&A System for News Retrieval**

**Authors:** Vsevolod Mironov, Pascal Küng, Alvaro Cervan (Group 5)


# Step 1 - Data preparation

**Contribution:** ....

**Goal of this step:** ....

## 1. Setup of the environment

Below the necessary libraries are installed and loaded into the environment.

In [6]:
!pip install -q beautifulsoup4==4.13.4
!pip install -q docling==2.31.0

In [16]:
import os
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup, Comment
import docling
import matplotlib.pyplot as plt

Below we mount a shared Google Drive folder as a data storage and define the base path of the folder that will be used in the runtime.

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
base_folder = '/content/drive/MyDrive/AdvGenAI'

## 2. Loading the raw data

We go through the subdirectories inside the data-folder. Inside those folders the individual html-files will be read and the content will be saved together with the information of the file-name and the path of the file (to store in which subfolder it was located).

In [12]:
# Definition of data folder
data_folder = os.path.join(base_folder, 'data')

In [None]:
# List to hold the dictionaries
data = []

# Walk through all directories and subdirectories
for root, dirs, files in os.walk(data_folder):
    for file in files:
        if file.endswith('.html'):
            file_path = os.path.join(root, file)

            # Read the content of the HTML file
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()

            # Add a dictionary to the list
            data.append({
                'folder_path': root,
                'file_name': file,
                'full_path': file_path,
                'html_content': content
            })

# Convert to DataFrame
df = pd.DataFrame(data)

# Optionally save DataFrame, e.g. to CSV or pickle for later use
# df.to_csv('html_files_content.csv', index=False)
# df.to_pickle('html_files_content.pkl')

# Show first rows to verify
print(df.head())

In [18]:
df.head()

Unnamed: 0,folder_path,file_name,full_path,html_content
0,/content/drive/MyDrive/AdvGenAI/data/de_intern...,der-r-pionier1.html,/content/drive/MyDrive/AdvGenAI/data/de_intern...,"<div class=""text-image cq-dd-image"">\n<p>Währe..."
1,/content/drive/MyDrive/AdvGenAI/data/de_intern...,web-of-science-alles-neu-macht-der--januar-.html,/content/drive/MyDrive/AdvGenAI/data/de_intern...,"<div class=""text-image cq-dd-image"">\n<p><a cl..."
2,/content/drive/MyDrive/AdvGenAI/data/de_intern...,swiss-life-sciences-2014-experten-gesucht.html,/content/drive/MyDrive/AdvGenAI/data/de_intern...,"<div class=""text-image cq-dd-image"">\n<h2>Staf..."
3,/content/drive/MyDrive/AdvGenAI/data/de_intern...,mendeley-literaturverwaltung.html,/content/drive/MyDrive/AdvGenAI/data/de_intern...,"<div class=""text-image cq-dd-image"">\n<p><a cl..."
4,/content/drive/MyDrive/AdvGenAI/data/de_intern...,mehr-orientierung.html,/content/drive/MyDrive/AdvGenAI/data/de_intern...,"<div class=""text-image cq-dd-image"">\n<p>Das S..."


Below we compare the number of documents collected by the function into the Dataframe with a selection of all files in the data folder.

In the check 3 files were discovered that were not part of the dataframe. After inspection it was discovered that those are `.DS_Store`file, for which it makes sense that they were not included.

In [19]:
# Dataframe
print(f"Number of files in the DataFrame: {len(df)}")

Number of files in the DataFrame: 4390


In [20]:
# Files in Data folder
print(f"Number of files in the data folder:")
!find "$data_folder" -type f | wc -l

Number of files in the data folder:
4393


In [25]:
!find "$data_folder" -type f | sort > folder_files.txt
df['full_path'].sort_values().to_csv('df_files.txt', index=False, header=False)
!sort folder_files.txt -o folder_files.txt
!sort df_files.txt -o df_files.txt
!comm -23 folder_files.txt df_files.txt

/content/drive/MyDrive/AdvGenAI/data/de_internal/2013/.DS_Store
/content/drive/MyDrive/AdvGenAI/data/de_internal/2015/.DS_Store
/content/drive/MyDrive/AdvGenAI/data/de_internal/2024/.DS_Store


## 3. Parsing and cleaning the HTML files

In [None]:
def clean_html(html_content):
    """
    Extracts the title and main content from HTML while removing unnecessary elements.

    Args:
        html_content (str): The HTML content to clean.

    Returns:
        tuple: (title, cleaned_content) where:
            - title (str): The title of the HTML document
            - cleaned_content (str): The main content of the HTML document with tags removed
    """
    # Parse the HTML
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract the title
    title = ""
    if soup.title:
        title = soup.title.get_text(strip=True)

    # Create a copy of the soup for content extraction
    content_soup = BeautifulSoup(str(soup), 'html.parser')

    # Remove unwanted elements
    for element in content_soup(['script', 'style', 'header', 'footer', 'nav', 'iframe', 'meta', 'link']):
        element.decompose()

    # Remove comments
    for comment in content_soup.find_all(text=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Get the body content or full document if no body
    main_content = content_soup.body or content_soup

    # Get clean text with spacing between elements
    clean_text = main_content.get_text(separator=' ', strip=True)

    return title, clean_text

# Example usage with a pandas DataFrame:
# df['title'], df['clean_content'] = zip(*df['html_contents'].apply(clean_html))