# Group Extractor Script Explanation

This notebook explains the `group_extractor.py` script, which extracts group information from an HTML file and saves it in a JSON file. The script uses BeautifulSoup for HTML parsing, handles data extraction, and ensures the output is properly formatted.

## 1. Introduction

The script is designed to:
1. Parse an HTML file to find all links (`<a>` tags).
2. Extract relevant data (URL, group name, and group ID) from each link.
3. Save the extracted data into a JSON file.

## 2. Import Necessary Libraries

The script imports several libraries:
- `BeautifulSoup` from `bs4` for parsing HTML.
- `json` for handling JSON data.
- `uuid` for generating unique identifiers.
- `os` for interacting with the operating system.
- `config_check_web` and `config_web` for configuration management.
- `open_file` for opening and reading data from JSON files.


## 3. Configuration Decorator

The function is decorated with `@config_check_web(config_web["get_data"])`, which likely performs some configuration checks or setups before the main function is executed.


In [None]:
@config_check_web(config_web["get_data"])
def get_group(html_file_path, output_json_path):
    # Open the HTML file
    with open(html_file_path, "r", encoding="utf-8") as file:
        html_content = file.read()

    # Parse HTML content
    soup = BeautifulSoup(html_content, "html.parser")


## 4. Parse HTML and Extract Data

The function reads the HTML file and uses BeautifulSoup to parse it. It then finds all `<a>` tags, initializes a list to store the data, and loads any existing group data from the output JSON file.


## 5. Handling Existing Groups

The script creates a lookup dictionary (`old_groups_lookup`) to match existing group names and URLs to their IDs. This ensures that groups with the same name and URL retain their original IDs.


In [None]:
    old_groups_lookup = {
        (group["name"], group["url"]): group["id"] for group in old_groups
    }


## 6. Extract New Group Data

The script iterates over each link, extracts the URL and group name, and assigns an ID. If a group with the same name and URL already exists, it uses the existing ID. Otherwise, it generates a new unique ID. The extracted data is then appended to the `data` list.


In [None]:
    # Extract URL, group name, and group id from each <a> tag and store in data list
    for link in links:
        url = "https://apl.unob.cz" + link["href"]
        group_name = link.text.strip(";")
        group_id = old_groups_lookup.get((group_name, url), str(uuid.uuid4()))
        data.append(
            {
                "id": group_id,
                "name": group_name,
                "url": url,
                "valid": True,
                "grouptype_id": "cd49e157-610c-11ed-9312-001a7dda7110",
            }
        )

    data = {"groups": data}


## 7. Save Extracted Data to JSON

Finally, the script saves the extracted group data to a JSON file. The data is saved in a readable format with indentation.


In [None]:
    # Save data to a JSON file
    with open(output_json_path, "w", encoding="utf-8") as outfile:
        json.dump(data, outfile, ensure_ascii=False, indent=4)

    print(f"Data has been saved to {output_json_path} file.")


The form of data should look like this:

In [None]:
"groups": [
        {
            "id": "819f5200-2c51-4cc8-aac6-a448ad78a6a8",
            "name": "11-5ŘPOS1",
            "url": "https://apl.unob.cz/MojeAP/StudiumSkupina/7253",
            "valid": true,
            "grouptype_id": "cd49e157-610c-11ed-9312-001a7dda7110"
        },
        {
            "id": "e6de5760-0b4b-4632-bf6c-d5abf489adf6",
            "name": "11-5ŘPOS2",
            "url": "https://apl.unob.cz/MojeAP/StudiumSkupina/7254",
            "valid": true,
            "grouptype_id": "cd49e157-610c-11ed-9312-001a7dda7110"
        },
        {
            "id": "4a2cd23a-6e3a-45ff-9e17-89127143c478",
            "name": "11-5ŘPOS3",
            "url": "https://apl.unob.cz/MojeAP/StudiumSkupina/7255",
            "valid": true,
            "grouptype_id": "cd49e157-610c-11ed-9312-001a7dda7110"
        },
        {
            "id": "8193aa05-2eba-412b-8379-4a83d213e15a",
            "name": "11-5ŘPOS4",
            "url": "https://apl.unob.cz/MojeAP/StudiumSkupina/7256",
            "valid": true,
            "grouptype_id": "cd49e157-610c-11ed-9312-001a7dda7110"
        }
]

## 8. Conclusion

In this notebook, we have broken down the `group_extractor.py` script, explaining each part and its purpose. This script is useful for extracting and organizing group information from an HTML file and saving it in a structured JSON format.

Ensure you have the necessary configurations and utility functions (`config_check_web`, `config_web`, `open_file`) available in your project for this script to run correctly.

