# Automated Web Scraping and Data Processing Workflow

### Imports
1. <u>**Selenium**</u> - is a web automation tool used to interact with web browsers, enabling tasks such as automated testing and web scraping.
2. <u>**BeautifulSoup**</u> - is a Python library used for parsing HTML and XML documents, providing easy methods for navigating, searching, and manipulating their contents.
3. <u>**JSON**</u> - is built-in library, it provides functionalities for encoding and decoding JSON data, facilitating easy conversion between JSON strings and Python data structures.

In [13]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service as FirefoxService
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import json

### GeckoDriver Setup and WebDriver Initialization
1. Define the path to the GeckoDriver executable (geckodriver.exe) required for automating Firefox.
2. Configure Browser Options: Enable headless mode in Firefox using Selenium options, meaning the browser will run without a graphical user interface.
3. Configure Firefox Service: Create a Firefox service instance with the specified GeckoDriver path.
4. Initialize WebDriver: Create a new instance of the Firefox WebDriver with the configured options and service.
5. Navigate to URL in the headless Firefox browser.
6. Setup WebDriverWait: Create a WebDriverWait object with a timeout of 10 seconds to wait for elements to become available on the webpage before proceeding with the script execution.

In [14]:
geckodriver_path = '.\geckodriver.exe' 

options = Options()
options.headless = True

firefox_service = FirefoxService(geckodriver_path)

driver = webdriver.Firefox(options=options, service=firefox_service)

url = 'https://www.noe.gv.at/wasserstand/#/de/Messstellen/Details/207407/DurchflussPrognose/48Stunden'
driver.get(url)

wait = WebDriverWait(driver, 1)

### Fetching and Parsing Table Data
1. Wait for Table Element: Use WebDriverWait to wait until a table element with the class name 'tabelle' is present on the webpage.
2. Get Inner HTML: Retrieve the inner HTML content of the table element.
3. Parse HTML Content: Use BeautifulSoup to parse the HTML content of the table and create a BeautifulSoup object (soup) for further processing.

In [15]:
table = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'tabelle')))

table_html = table.get_attribute('innerHTML')

soup = BeautifulSoup(table_html, 'html.parser')

### Extraction and Structuring of Tabular Data
1. Data Extraction Loop: Iterate through each row (tr) in the parsed HTML content using BeautifulSoup.
2. Column Extraction: For each row, extract all columns (td) within it.
3. Data Validation: Check if the number of columns is 2 before proceeding.
4. Retrieve Text Content: Get the text content of the first and second columns.
5. Data Structuring: Append a dictionary to the 'data' list, where keys are 'column1' and 'column2', and values are the respective text content from the columns.

In [16]:
data = []
for row in soup.find_all('tr'):
    columns = row.find_all('td')
    if len(columns) == 2: # Number 2 because we are looking for table with this exact size
        column1 = columns[0].get_text(strip=True)
        column2 = columns[1].get_text(strip=True)
        data.append({'column1': column1, 'column2': column2})

### Serialization of Data to JSON File
1. Output File Definition: Specify the name of the output JSON file as 'data.json'.
2. JSON File Creation: Open the output JSON file in write mode ('w') using a context manager (with statement).
3. Data Serialization: Serialize the 'data' list containing dictionaries into JSON format using the json.dump() function. The indent=4 argument ensures that the JSON data is formatted with an indentation of four spaces for readability.
4. File Closure: Automatically close the JSON file after writing the data.
5. Confirmation Message: Print a message indicating that the data has been saved as 'data.json' in the current directory.

In [17]:
output_file = 'data.json'
with open(output_file, 'w') as json_file:
    json.dump(data, json_file, indent=4)

print("Data saved as data.json in the current directory")

Data saved as data.json in the current directory


### Cleanup and Resource Release

In [18]:
driver.quit()