# Selenium Web Scraping and Data Extraction Example

This notebook explains the `get_membership.py` script, which extracts student information from a set of web pages and saves it in a JSON file. The script uses Selenium for web scraping and BeautifulSoup for HTML parsing.

## 1. Introduction

The script is designed to:
1. Login to a specified website.
2. Extract group and student information from the given URLs.
3. Save the extracted data into a JSON file.

## 2. Import Necessary Libraries

The script imports several libraries:
- `selenium`: The core library for browser automation.
- `webdriver_manager.chrome`: To manage and download the ChromeDriver.
- `json`: To handle JSON data.
- `time`: To add delays where necessary.
- `uuid`: To generate unique identifiers.
- `os`: To interact with the operating system.
- `login` from `utils.a_login`: To perform the login operation.
- Configuration utilities from `utils._config`.


## 3. Create Student List Function

The `create_student_list` function retrieves student data from specified URLs. It navigates to each URL, extracts student names and links, and stores them in a list.

### Explanation of the Code
- `wait.until(EC.presence_of_all_elements_located...)`: Waits until all student links are present on the page.
- Extracts student data and appends it to the `students` list.
- Handles exceptions such as `TimeoutException` and `StaleElementReferenceException`.


In [None]:
def create_student_list(urls_data, driver, wait):
    students = []

    for item in urls_data:
        group_name = item["name"]
        group_id = item["id"]
        url = item["url"]
        driver.get(url)
        # time.sleep(0.05)

        # Find all student links
        try:
            links = wait.until(
                EC.presence_of_all_elements_located(
                    (By.CSS_SELECTOR, "#StudijniSkupinaStudents a")
                )
            )
        except TimeoutException:
            print(f"Timeout while waiting for student links in group {group_name}.")
            continue

        print(f" - Opening URL for group {group_name}: {url}\n")

        # Extract usernames and IDs
        for link in links:
            try:
                student_link = link.get_attribute("href")
                student_name = link.text.strip()

                students.append(
                    {
                        "name": student_name,
                        "group": group_name,
                        "group_id": group_id,
                        "valid": True,
                        "link": student_link,
                    }
                )
            except StaleElementReferenceException as e:
                print(
                    f"StaleElementReferenceException occurred for student {link.text.strip() if link else 'unknown'}: {e}"
                )

    return students


## 4. Filter Students Function

The `filter_students` function checks if the student data already exists. If it does, it updates the existing data. If not, it extracts additional information and generates a unique ID.

### Explanation of the Code
- Iterates over each student and checks if they already exist.
- Extracts additional student information such as email, year of study, faculty, and data box.
- Handles exceptions such as `TimeoutException`, `NoSuchElementException`, and `StaleElementReferenceException`.
- Merges old and new student data to filter out duplicates.


In [None]:
def filter_students(students, old_students, driver, wait):
    index = 1
    for student in students:
        check = False
        for old_student in old_students:

            if (
                student["name"] == old_student["name"]
                and student["group"] == old_student["group"]
            ):
                student["id"] = old_student["id"]
                student["email"] = old_student["email"]
                student["rocnik"] = old_student["rocnik"]
                student["fakulta"] = old_student["fakulta"]
                student["datova_schranka"] = old_student["datova_schranka"]

                print(
                    f"######{index}. {student['name']} - {student['group']} already exists, skipping."
                )

                check = True
                index += 1
                break

        if check:
            continue

        try:
            driver.get(student["link"])
            # time.sleep(0.05)

            student["id"] = str(uuid.uuid4())

            # Email
            email_element = wait.until(
                EC.presence_of_element_located(
                    (By.XPATH, "//div[strong[text()='E-mail:']]/following-sibling::div")
                )
            )
            student["email"] = email_element.text if email_element else "N/A"

            # Year of study
            rocnik_element = wait.until(
                EC.presence_of_element_located(
                    (By.XPATH, "//div[strong[text()='Ročník:']]/following-sibling::div")
                )
            )
            student["rocnik"] = rocnik_element.text if rocnik_element else "N/A"

            # Faculty
            fakulta_element = wait.until(
                EC.presence_of_element_located(
                    (
                        By.XPATH,
                        "//div[strong[text()='Fakulta:']]/following-sibling::div/a",
                    )
                )
            )
            student["fakulta"] = fakulta_element.text if fakulta_element else "N/A"

            # Data box
            datova_schranka_element = wait.until(
                EC.presence_of_element_located(
                    (
                        By.XPATH,
                        "//div[strong[text()='Datová schránka:']]/following-sibling::div",
                    )
                )
            )
            student["datova_schranka"] = (
                datova_schranka_element.text if datova_schranka_element else "N/A"
            )

            # Display on terminal
            print(f"{index}. {student['name']} - {student['group']} has been loaded.")
            index += 1

        # Handle errors
        except (
            TimeoutException,
            NoSuchElementException,
            StaleElementReferenceException,
        ) as e:
            print(f"Error occurred for student {student['name']}: {e}")

    # Merge old and new students to filter out duplicates and update additional data
    student_dict = {}

    for student in old_students:
        student_dict[student["id"]] = student

    for student in students:
        # student_dict[student["id"]] = student
        student_dict[student["id"]] = {**student_dict.get(student["id"], {}), **student}

    filtered_list = list(student_dict.values())

    return filtered_list


## 5. Main Function to Get Student Data

The `get_student` function sets up the WebDriver, logs into the website, loads group data, and processes the student data. Finally, it saves the extracted data into a JSON file.

### Explanation of the Code
- Sets up Chrome options and initializes the WebDriver.
- Logs into the website using the `login` function.
- Loads group data from a JSON file.
- Extracts student data using the `create_student_list` function.
- Filters and updates student data using the `filter_students` function.
- Saves the final data to a JSON file and closes the WebDriver.


In [None]:
@config_check_web(config_web["get_data"])
def get_student():
    # Setup Chrome options
    options = Options()
    options.add_experimental_option("detach", True)

    # Start a new instance of Chrome WebDriver
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()), options=options
    )
    wait = WebDriverWait(driver, 10)
    print("WebDriver has been started.")

    # Login to the website
    login(driver)

    # Load group page
    from utils.extract_data import open_file

    output_json_path = "utils/b_groups_data.json"
    urls_data = open_file(output_json_path, "groups")

    print(len(urls_data), " groups have been founded.\n")

    # Get student data from the URL
    students = create_student_list(urls_data, driver, wait)

    print(f"Total {len(students)} students have been founded.\n")

    print("-----------------------------------------------------------")
    print("Data comparison...\n")

    student_path = "utils/c_students_data.json"
    old_students = open_file(student_path, "users")

    print("All students have been loaded.\n")
    time.sleep(0.5)
    print("Removing unnecessary data...\n")

    # Save the extracted data to a JSON file
    filtered_list = filter_students(students, old_students, driver, wait)

    students_list = {"users": filtered_list}
    time.sleep(0.5)

    print("Saving data to data.json file...\n")

    with open("utils/c_students_data.json", "w", encoding="utf-8") as outfile:
        json.dump(students_list, outfile, ensure_ascii=False, indent=4)

    time.sleep(0.5)

    print("Data has been saved to data.json file.\n")

    driver.quit()
    print("WebDriver has been closed.")


## 6. Conclusion

In this notebook, we have broken down the `get_membership.py` script, explaining each part and its purpose. This script is useful for extracting and organizing student information from a set of web pages and saving it in a structured JSON format.




