# Introduction

The first step in our task is to obtain the data necessary for analysis. Since our company is in the early stages of development and does not have its own database, we intend to use publicly available resources.  
  
For this purpose, we have been recommended the website [Scrape This Site](https://www.scrapethissite.com/pages/forms/). However, before we start downloading data, it is important to carefully review the [FAQ](https://www.scrapethissite.com/faq/) section on the site. Particular attention should be paid to the restrictions on the number of requests, which is crucial for our solution.  
  
It is expected that after executing the code contained in this notebook, the `data/raw/` folder will be populated with data, which will serve as the source for the next stage of the project.

# Notebook Configuration

## Importing Required Libraries

In [1]:
from pathlib import Path

## Driver and Selenium Configuration

In [3]:
from pathlib import Path
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

DRIVER_PATH = Path(r"C:\Users\mjemelka\Desktop\Python\chromedriver-win64\chromedriver-win64\chromedriver.exe")
assert DRIVER_PATH.exists(), f"ChromeDriver nenalezen: {DRIVER_PATH}"

chrome_options = Options()
chrome_options.add_argument("--headless=new")
chrome_options.add_argument("--window-size=1920,1080")

service = Service(str(DRIVER_PATH))
driver = webdriver.Chrome(service=service, options=chrome_options)


# Fetching Website Content

This section of the notebook contains code for fetching website content. To properly execute the task, consider the following steps:  
- Ensure all available data on the site has been fetched by checking if there are additional data pages.  
- Locate the data of interest on the page using `html` inspection tools.  
- Navigate between subsequent data pages using browser mechanisms or by analyzing the `url` structure.  
  
> Remember to respect the query limits specified in the `FAQ`!  
  
Save the fetched data to the folder `data/raw/hockey_teams_page_{page_number}.html`. At this stage, we are retrieving data without processing it - analysis will be performed later.  
  
To fetch the `html` content of the page, you can use `browser.page_source`. Make sure the browser tool configuration (e.g., Selenium) is ready for use.  
  
> (Optional) If there are multiple pages to fetch, use the [zfill](https://www.programiz.com/python-programming/methods/string/zfill) function to maintain order in file names by adding leading zeros to the page numbers.



In [6]:
from selenium.webdriver.common.by import By
import time

raw_dir = Path("data/raw")
raw_dir.mkdir(parents=True, exist_ok=True)

page_num = 1
while True:
    out_path = raw_dir / f"hockey_teams_page_{str(page_num).zfill(2)}.html"
    out_path.write_text(driver.page_source, encoding="utf-8")
    print(f"Uloženo: {out_path}")

    next_buttons = driver.find_elements(By.XPATH, "//a[contains(., 'Next') or contains(., '»')]")

    if not next_buttons:
        print("Další stránka nenalezena — konec.")
        break

    try:
        next_buttons[0].click()
        page_num += 1
        time.sleep(2)
    except Exception as e:
        print(f"Konec — další stránku nelze otevřít ({e})")
        break


Uloženo: data\raw\hockey_teams_page_01.html
Uloženo: data\raw\hockey_teams_page_02.html
Uloženo: data\raw\hockey_teams_page_03.html
Uloženo: data\raw\hockey_teams_page_04.html
Uloženo: data\raw\hockey_teams_page_05.html
Uloženo: data\raw\hockey_teams_page_06.html
Uloženo: data\raw\hockey_teams_page_07.html
Uloženo: data\raw\hockey_teams_page_08.html
Uloženo: data\raw\hockey_teams_page_09.html
Uloženo: data\raw\hockey_teams_page_10.html
Uloženo: data\raw\hockey_teams_page_11.html
Uloženo: data\raw\hockey_teams_page_12.html
Uloženo: data\raw\hockey_teams_page_13.html
Uloženo: data\raw\hockey_teams_page_14.html
Uloženo: data\raw\hockey_teams_page_15.html
Uloženo: data\raw\hockey_teams_page_16.html
Uloženo: data\raw\hockey_teams_page_17.html
Uloženo: data\raw\hockey_teams_page_18.html
Uloženo: data\raw\hockey_teams_page_19.html
Uloženo: data\raw\hockey_teams_page_20.html
Uloženo: data\raw\hockey_teams_page_21.html
Uloženo: data\raw\hockey_teams_page_22.html
Uloženo: data\raw\hockey_teams_p

# Summary

Downloading raw data from our source has reduced the risk of problems stemming from site updates during the extraction process. This method also offers an additional benefit: it allows easy access to the data in its original form, which is crucial if reprocessing is needed.

In the next step, we will focus on extracting the necessary information from the `html` pages, which is essential for conducting the analysis.